[
  {
    "path": ".gitignore",
    "content": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packaging\n.Python\nbuild/\ndevelop-eggs/\ndist/\ndownloads/\neggs/\n.eggs/\nlib/\nlib64/\nparts/\nsdist/\nvar/\nwheels/\n*.egg-info/\n.installed.cfg\n*.egg\nMANIFEST\n\n# PyInstaller\n#  Usually these files are written by a python script from a template\n#  before PyInstaller builds the exe, so as to inject date/other infos into it.\n*.manifest\n*.spec\n\n# Installer logs\npip-log.txt\npip-delete-this-directory.txt\n\n# Unit test / coverage reports\nhtmlcov/\n.tox/\n.coverage\n.coverage.*\n.cache\nnosetests.xml\ncoverage.xml\n*.cover\n.hypothesis/\n.pytest_cache/\n\n# Translations\n*.mo\n*.pot\n\n# Django stuff:\n*.log\nlocal_settings.py\ndb.sqlite3\n\n# Flask stuff:\ninstance/\n.webassets-cache\n\n# Scrapy stuff:\n.scrapy\n\n# Sphinx documentation\ndocs/_build/\n\n# PyBuilder\ntarget/\n\n# Jupyter Notebook\n.ipynb_checkpoints\n\n# pyenv\n.python-version\n\n# celery beat schedule file\ncelerybeat-schedule\n\n# SageMath parsed files\n*.sage.py\n\n# Environments\n.env\n.venv\nenv/\nvenv/\nENV/\nenv.bak/\nvenv.bak/\n\n# Spyder project settings\n.spyderproject\n.spyproject\n\n# Rope project settings\n.ropeproject\n\n# mkdocs documentation\n/site\n\n# mypy\n.mypy_cache/\n\n# local environment stuff\n.DS_Store\ndata\n.vscode\n"
  },
  {
    "path": "LICENSE",
    "content": "MIT License\n\nCopyright (c) [2019] [textpack]\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE."
  },
  {
    "path": "Pipfile",
    "content": "[[source]]\nname = \"pypi\"\nurl = \"https://pypi.org/simple\"\nverify_ssl = true\n\n[dev-packages]\npep8 = \"*\"\npylint = \"*\"\n\n[packages]\npandas = \"*\"\nsklearn = \"*\"\nscipy = \"*\"\nnumpy = \"*\"\ncython = \"*\"\nsparse-dot-topn = \"*\"\nkaggle = \"*\"\n\n[requires]\npython_version = \"3.7\"\n"
  },
  {
    "path": "README.md",
    "content": "## What is this?\n\nTextPack efficiently groups similar values in large (or small) datasets. Under the hood, it builds a document term matrix of n-grams assigned a TF-IDF score. It then uses matrix multiplication to calculate the cosine similarity between these values. For a technical explination, [I wrote a blog post](https://medium.com/p/2493b3ce6d8d).\n\n## Why do I care?\n\nIf you're a analyst, journalist, data scientist or similar and have ever had a spreadsheet, SQL table or JSON string filled with inconsistent inputs like this:\n\n| row |     fullname      |\n|-----|-------------------|\n|   1 | John F. Doe       |\n|   2 | Esquivel, Mara    |\n|   3 | Doe, John F       |\n|   4 | Whyte, Luke       |\n|   5 | Doe, John Francis |\n\nAnd you want to perform some kind of analysis – perhaps in a Pivot Table or a Group By statement – but are hindered by the deviations in spelling and formatting, you can use TextPack to comb thousands of cells in seconds and create a third column like this:\n\n| row |     fullname      |  name_groups  |\n|-----|-------------------|---------------|\n|   1 | John F. Doe       | Doe John F    |\n|   2 | Esquivel, Mara    | Esquivel Mara |\n|   3 | Doe, John F       | Doe John F    |\n|   4 | Whyte, Luke       | Whyte Luke    |\n|   5 | Doe, John Francis | Doe John F    |\n\nWe can then group by `name_groups` and perform our analysis. \n\nYou can also group across multiple columns. For instance, given the following:\n\n| row |  make  |   model   |\n|-----|--------|-----------|\n|   1 | Toyota | Camry     |\n|   2 | toyta  | camry DXV |\n|   3 | Ford   | F-150     |\n|   4 | Toyota | Tundra    |\n|   5 | Honda  | Accord    |\n\nYou can group across `make` and `model` to create:\n\n| row |  make  |   model   |  car_groups  |\n|-----|--------|-----------|--------------|\n|   1 | Toyota | Camry     | toyotacamry  |\n|   2 | toyta  | camry DXV | toyotacamry  |\n|   3 | Ford   | F-150     | fordf150     |\n|   4 | Toyota | Tundra    | toyotatundra |\n|   5 | Honda  | Accord    | hondaaccord  |\n\n## How do I use it?\n\n#### Installation\n\n```\npip install textpack\n```\n\n#### Import module\n\n```\nfrom textpack import tp\n```\n\n#### Instantiate TextPack\n\n```\ntp.TextPack(df, columns_to_group, match_threshold=0.75, ngram_remove=r'[,-./]', ngram_length=3)\n```\n\nClass parameters:\n\n - `df` (required): A Pandas' DataFrame containing the dataset to group\n - `columns_to_group` (required): A list or string matching the column header(s) you'd like to parse and group\n - `match_threshold` (optional): This is a floating point number between 0 and 1 that represents the cosine similarity threshold we'll use to determine if two strings should be grouped. The closer the threshold to 1, the higher the similarity will need to be to be considered a match.\n - `ngram_remove` (optional): A regular expression you can use to filter characters out of your strings when we build our n-grams.\n - `ngram_length` (optional): The length of our n-grams. This can be used in tandem with `match_threshold` to find the sweet spot for grouping your dataset. If TextPack is running slow, it's usually a sign to consider raising the n-gram length.\n\nTextPack can also be instantiated using the following helpers, each of which is just a wrapper that converts a data format to a Pandas DataFrame and then passes it to TextPack. Thus, they all require a file path, `columns_to_group` and take the same three optional parameters as calling `TextPack` directly.\n\n```\ntp.read_csv(csv_path, columns_to_group, match_threshold=0.75, ngram_remove=r'[,-./]', ngram_length=3)\n```\n\n```\ntp.read_excel(excel_path, columns_to_group, sheet_name=None, match_threshold=0.75, ngram_remove=r'[,-./]', ngram_length=3)\n```\n\n```\ntp.read_json(json_path, columns_to_group, match_threshold=0.75, ngram_remove=r'[,-./]', ngram_length=3)\n```\n\n#### Run Textpack and group values\n\nTextPack objects have the following public properties:\n\n - `df`: The dataframe used internally by TextPack – manipulate as you see fit\n - `group_lookup`: A Python dictionary built by `build_group_lookup` and then used by `add_grouped_column_to_data` to lookup each value that has a group. It looks like this: \n\n```\n{ \n    'John F. Doe': 'Doe John F',\n    'Doe, John F': 'Doe John F',\n    'Doe, John Francis': 'Doe John F'\n}\n```\n\nTextpack objects also have the following public methods:\n\n - `build_group_lookup()`: Runs the cosine similarity analysis and builds `group_lookup`.\n - `add_grouped_column_to_data(column_name='Group')`: Uses vectorization to map values to groups via `group_lookup` and add the new column to the DataFrame. The column header can be set via `column_name`.\n - `set_match_threshold(match_threshold)`: Modify the match threshold internally.\n - `set_ngram_remove(ngram_remove)`: Modify the n-gram regex filter internally.\n - `set_ngram_length(ngram_length)`: Modify the n-gram length internally.\n - `run(column_name='Group')`: A helper function that calls `build_group_lookup` followed by `add_grouped_column_to_data`.\n\n #### Export our grouped dataset\n\n  - `export_json(export_path)`\n  - `export_csv(export_path)`\n\n#### A simple example\n\n```\nfrom textpack import tp\n\ncars = tp.read_csv('./cars.csv', ['make', 'model'], match_threshold=0.8, ngram_length=5)\n\ncars.run()\n\ncars.export_csv('./cars-grouped.csv')\n```\n\n## Troubleshooting\n\n#### I'm getting a Memory Error!\n\nSome users have triggered memory errors when parsing big data sets. [This StackOverflow post](https://stackoverflow.com/questions/57507832/unable-to-allocate-array-with-shape-and-data-type) has proved useful.\n\n## How does it work?\n\nAs mentioned above, under the hood, we're building a document term matrix of n-grams assigned a TF-IDF score. We're then using matrix multipcation to quickly calculate the cosine similarity between these values.\n\nI wrote [this blog post](https://medium.com/p/2493b3ce6d8d) to explian how TextPack works behind the scene. Check it out!\n"
  },
  {
    "path": "run_tests.sh",
    "content": "if [ ! -f textpack/tests/craigslistVehicles.csv ]; then # state files\n  kaggle datasets download -d austinreese/craigslist-carstrucks-data -p textpack/tests\n  unzip -o textpack/tests/craigslist-carstrucks-data.zip -d textpack/tests\n  chmod 644 textpack/tests/craigslistVehicles.csv \n  rm -rf textpack/tests/craigslistVehiclesFull.csv textpack/tests/craigslist-carstrucks-data.zip\nfi\n\npython -m unittest discover -s textpack/tests\n\n# Delete the craigslistVehicles file (1.1GB)\nrm -rf textpack/tests/craigslistVehicles.csv\n"
  },
  {
    "path": "setup.py",
    "content": "import setuptools\n\nwith open(\"README.md\", \"r\") as fh:\n    long_description = fh.read()\n\nsetuptools.setup(\n    name=\"textpack\",\n    version=\"0.0.5\",\n    author=\"Luke Whyte\",\n    author_email=\"lukeawhyte@gmail.com\",\n    description=\"Quickly identify and group similar text strings in a large dataset\",\n    long_description=long_description,\n    long_description_content_type=\"text/markdown\",\n    url=\"https://github.com/lukewhyte/textpack\",\n    packages=setuptools.find_packages(),\n    install_requires=[\n        'pandas',\n        'sklearn',\n        'scipy',\n        'numpy',\n        'cython',\n        'sparse-dot-topn'\n    ],\n    classifiers=[\n        \"Programming Language :: Python :: 3\",\n        \"License :: OSI Approved :: MIT License\",\n        \"Operating System :: OS Independent\",\n    ],\n)\n"
  },
  {
    "path": "textpack/__init__.py",
    "content": "name = \"textpack\""
  },
  {
    "path": "textpack/tests/test_tp.py",
    "content": "import unittest\nimport json\n\nfrom textpack import tp\n\n\nclass TestTextPack(unittest.TestCase):\n    def test_build_group_index(self):\n        cl_vehicles = tp.read_csv('./textpack/tests/craigslistVehicles.csv', ['manufacturer', 'make'])\n        cl_vehicles.build_group_lookup()\n\n        expected = 'toyotacamry'\n        result = cl_vehicles.group_lookup['toyotacamry dx v6']\n\n        self.assertEqual(expected, result)\n\n    def test_add_grouped_column_to_data(self):\n        cl_vehicles = tp.read_csv('./textpack/tests/craigslistVehicles.csv', ['manufacturer', 'make'])\n\n        cl_vehicles.build_group_lookup()\n        cl_vehicles.add_grouped_column_to_data()\n\n        expected = 443405\n        result = cl_vehicles.df['textpackGrouper'].shape[0]\n\n        self.assertEqual(expected, result)\n\n    def test_export_json(self):\n        cl_vehicles = tp.read_csv('./textpack/tests/craigslistVehicles.csv', ['manufacturer', 'make'])\n\n        cl_vehicles.run()\n\n        expected = 'https://youngstown.craigslist.org/ctd/d/youngstown-2018-kia-forte-lx-sedan-4d/6886405952.html'\n\n        print('Loading JSON, hold fast...')\n        result = json.loads(cl_vehicles.export_json())['url']['439513']\n\n        self.assertEqual(expected, result)\n"
  },
  {
    "path": "textpack/tp.py",
    "content": "import re\nimport numpy as np\nimport pandas as pd\nfrom sklearn.feature_extraction.text import TfidfVectorizer\nfrom sparse_dot_topn import awesome_cossim_topn\n\n\nclass TextPack():\n    def __init__(self, df, columns_to_group, match_threshold=0.75, ngram_remove=r'[,-./]', ngram_length=3):\n        self.df = df\n        self.group_lookup = {}\n        self._column = self._get_column(columns_to_group)\n        self._match_threshold = match_threshold\n        self._ngram_remove = ngram_remove\n        self._ngram_length = ngram_length\n\n    def _get_column(self, columns_to_group):\n        if ''.join(columns_to_group) in self.df.columns:\n            return ''.join(columns_to_group)\n        else:\n            self.df['textpackGrouper'] = self.df[columns_to_group.pop(0)].astype(str).str.cat(self.df[columns_to_group].astype(str))\n            return 'textpackGrouper'\n\n    def _ngrams_analyzer(self, string):\n        string = re.sub(self._ngram_remove, r'', string)\n        ngrams = zip(*[string[i:] for i in range(self._ngram_length)])\n        return [''.join(ngram) for ngram in ngrams]\n\n    def _get_tf_idf_matrix(self, vals):\n        vectorizer = TfidfVectorizer(analyzer=self._ngrams_analyzer)\n        return vectorizer.fit_transform(vals)\n\n    def _get_cosine_matrix(self, vals):\n        tf_idf_matrix = self._get_tf_idf_matrix(vals)\n        return awesome_cossim_topn(tf_idf_matrix, tf_idf_matrix.transpose(), vals.size, self._match_threshold)\n\n    def _find_group(self, y, x):\n        if y in self.group_lookup:\n            return self.group_lookup[y]\n        elif x in self.group_lookup:\n            return self.group_lookup[x]\n        else:\n            return None\n\n    def _add_vals_to_lookup(self, group, y, x):\n        self.group_lookup[y] = group\n        self.group_lookup[x] = group\n\n    def _add_pair_to_lookup(self, row, col):\n        group = self._find_group(row, col)\n        if group is not None:\n            self._add_vals_to_lookup(group, row, col)\n        else:\n            self._add_vals_to_lookup(row, row, col)\n\n    def set_ngram_remove(self, ngram_remove):\n        self._ngram_remove = ngram_remove\n\n    def set_ngram_length(self, ngram_length):\n        self._ngram_length = ngram_length\n\n    def set_match_threshold(self, match_threshold):\n        self._match_threshold = match_threshold\n\n    def build_group_lookup(self):\n        vals = self.df[self._column].unique().astype('U')\n\n        print('Building the TF-IDF, Cosine & Coord matrices...')\n        coord_matrix = self._get_cosine_matrix(vals).tocoo()\n\n        print('Building the group lookup...')\n        for row, col in zip(coord_matrix.row, coord_matrix.col):\n            if row != col:\n                self._add_pair_to_lookup(vals[row], vals[col])\n\n    def add_grouped_column_to_data(self, column_name='Group'):\n        print('Adding grouped columns to data frame...')\n        self.df[column_name] = self.df[self._column].map(self.group_lookup).fillna(self.df[self._column])\n\n    def run(self, column_name='Group'):\n        self.build_group_lookup()\n        self.add_grouped_column_to_data(column_name)\n        print('Ready for export')\n\n    def _filter_df_for_export(self):\n        return self.df.drop(columns=['textpackGrouper']) if 'textpackGrouper' in self.df.columns else self.df\n\n    def export_json(self, export_path=None):\n        return self._filter_df_for_export().to_json(export_path)\n\n    def export_csv(self, export_path=None):\n        return self._filter_df_for_export().to_csv(export_path)\n\n\ndef read_json(json_path, columns_to_group, match_threshold=0.75, ngram_remove=r'[,-./]', ngram_length=3):\n    return TextPack(pd.read_json(json_path), columns_to_group, match_threshold, ngram_remove, ngram_length)\n\n\ndef read_excel(excel_path, columns_to_group, sheet_name=None, match_threshold=0.75, ngram_remove=r'[,-./]', ngram_length=3):\n    return TextPack(pd.read_excel(excel_path), sheet_name, columns_to_group, match_threshold, ngram_remove, ngram_length)\n\n\ndef read_csv(csv_path, columns_to_group, match_threshold=0.75, ngram_remove=r'[,-./]', ngram_length=3):\n    return TextPack(pd.read_csv(csv_path), columns_to_group, match_threshold, ngram_remove, ngram_length)\n"
  }
]