Full Code of lukewhyte/textpack for AI

master 45e94cfde654 cached
9 files
14.8 KB
4.0k tokens
25 symbols
1 requests
Download .txt
Repository: lukewhyte/textpack
Branch: master
Commit: 45e94cfde654
Files: 9
Total size: 14.8 KB

Directory structure:
gitextract_6qdfimih/

├── .gitignore
├── LICENSE
├── Pipfile
├── README.md
├── run_tests.sh
├── setup.py
└── textpack/
    ├── __init__.py
    ├── tests/
    │   └── test_tp.py
    └── tp.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/
.pytest_cache/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# pyenv
.python-version

# celery beat schedule file
celerybeat-schedule

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/

# local environment stuff
.DS_Store
data
.vscode


================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) [2019] [textpack]

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

================================================
FILE: Pipfile
================================================
[[source]]
name = "pypi"
url = "https://pypi.org/simple"
verify_ssl = true

[dev-packages]
pep8 = "*"
pylint = "*"

[packages]
pandas = "*"
sklearn = "*"
scipy = "*"
numpy = "*"
cython = "*"
sparse-dot-topn = "*"
kaggle = "*"

[requires]
python_version = "3.7"


================================================
FILE: README.md
================================================
## What is this?

TextPack efficiently groups similar values in large (or small) datasets. Under the hood, it builds a document term matrix of n-grams assigned a TF-IDF score. It then uses matrix multiplication to calculate the cosine similarity between these values. For a technical explination, [I wrote a blog post](https://medium.com/p/2493b3ce6d8d).

## Why do I care?

If you're a analyst, journalist, data scientist or similar and have ever had a spreadsheet, SQL table or JSON string filled with inconsistent inputs like this:

| row |     fullname      |
|-----|-------------------|
|   1 | John F. Doe       |
|   2 | Esquivel, Mara    |
|   3 | Doe, John F       |
|   4 | Whyte, Luke       |
|   5 | Doe, John Francis |

And you want to perform some kind of analysis – perhaps in a Pivot Table or a Group By statement – but are hindered by the deviations in spelling and formatting, you can use TextPack to comb thousands of cells in seconds and create a third column like this:

| row |     fullname      |  name_groups  |
|-----|-------------------|---------------|
|   1 | John F. Doe       | Doe John F    |
|   2 | Esquivel, Mara    | Esquivel Mara |
|   3 | Doe, John F       | Doe John F    |
|   4 | Whyte, Luke       | Whyte Luke    |
|   5 | Doe, John Francis | Doe John F    |

We can then group by `name_groups` and perform our analysis. 

You can also group across multiple columns. For instance, given the following:

| row |  make  |   model   |
|-----|--------|-----------|
|   1 | Toyota | Camry     |
|   2 | toyta  | camry DXV |
|   3 | Ford   | F-150     |
|   4 | Toyota | Tundra    |
|   5 | Honda  | Accord    |

You can group across `make` and `model` to create:

| row |  make  |   model   |  car_groups  |
|-----|--------|-----------|--------------|
|   1 | Toyota | Camry     | toyotacamry  |
|   2 | toyta  | camry DXV | toyotacamry  |
|   3 | Ford   | F-150     | fordf150     |
|   4 | Toyota | Tundra    | toyotatundra |
|   5 | Honda  | Accord    | hondaaccord  |

## How do I use it?

#### Installation

```
pip install textpack
```

#### Import module

```
from textpack import tp
```

#### Instantiate TextPack

```
tp.TextPack(df, columns_to_group, match_threshold=0.75, ngram_remove=r'[,-./]', ngram_length=3)
```

Class parameters:

 - `df` (required): A Pandas' DataFrame containing the dataset to group
 - `columns_to_group` (required): A list or string matching the column header(s) you'd like to parse and group
 - `match_threshold` (optional): This is a floating point number between 0 and 1 that represents the cosine similarity threshold we'll use to determine if two strings should be grouped. The closer the threshold to 1, the higher the similarity will need to be to be considered a match.
 - `ngram_remove` (optional): A regular expression you can use to filter characters out of your strings when we build our n-grams.
 - `ngram_length` (optional): The length of our n-grams. This can be used in tandem with `match_threshold` to find the sweet spot for grouping your dataset. If TextPack is running slow, it's usually a sign to consider raising the n-gram length.

TextPack can also be instantiated using the following helpers, each of which is just a wrapper that converts a data format to a Pandas DataFrame and then passes it to TextPack. Thus, they all require a file path, `columns_to_group` and take the same three optional parameters as calling `TextPack` directly.

```
tp.read_csv(csv_path, columns_to_group, match_threshold=0.75, ngram_remove=r'[,-./]', ngram_length=3)
```

```
tp.read_excel(excel_path, columns_to_group, sheet_name=None, match_threshold=0.75, ngram_remove=r'[,-./]', ngram_length=3)
```

```
tp.read_json(json_path, columns_to_group, match_threshold=0.75, ngram_remove=r'[,-./]', ngram_length=3)
```

#### Run Textpack and group values

TextPack objects have the following public properties:

 - `df`: The dataframe used internally by TextPack – manipulate as you see fit
 - `group_lookup`: A Python dictionary built by `build_group_lookup` and then used by `add_grouped_column_to_data` to lookup each value that has a group. It looks like this: 

```
{ 
    'John F. Doe': 'Doe John F',
    'Doe, John F': 'Doe John F',
    'Doe, John Francis': 'Doe John F'
}
```

Textpack objects also have the following public methods:

 - `build_group_lookup()`: Runs the cosine similarity analysis and builds `group_lookup`.
 - `add_grouped_column_to_data(column_name='Group')`: Uses vectorization to map values to groups via `group_lookup` and add the new column to the DataFrame. The column header can be set via `column_name`.
 - `set_match_threshold(match_threshold)`: Modify the match threshold internally.
 - `set_ngram_remove(ngram_remove)`: Modify the n-gram regex filter internally.
 - `set_ngram_length(ngram_length)`: Modify the n-gram length internally.
 - `run(column_name='Group')`: A helper function that calls `build_group_lookup` followed by `add_grouped_column_to_data`.

 #### Export our grouped dataset

  - `export_json(export_path)`
  - `export_csv(export_path)`

#### A simple example

```
from textpack import tp

cars = tp.read_csv('./cars.csv', ['make', 'model'], match_threshold=0.8, ngram_length=5)

cars.run()

cars.export_csv('./cars-grouped.csv')
```

## Troubleshooting

#### I'm getting a Memory Error!

Some users have triggered memory errors when parsing big data sets. [This StackOverflow post](https://stackoverflow.com/questions/57507832/unable-to-allocate-array-with-shape-and-data-type) has proved useful.

## How does it work?

As mentioned above, under the hood, we're building a document term matrix of n-grams assigned a TF-IDF score. We're then using matrix multipcation to quickly calculate the cosine similarity between these values.

I wrote [this blog post](https://medium.com/p/2493b3ce6d8d) to explian how TextPack works behind the scene. Check it out!


================================================
FILE: run_tests.sh
================================================
if [ ! -f textpack/tests/craigslistVehicles.csv ]; then # state files
  kaggle datasets download -d austinreese/craigslist-carstrucks-data -p textpack/tests
  unzip -o textpack/tests/craigslist-carstrucks-data.zip -d textpack/tests
  chmod 644 textpack/tests/craigslistVehicles.csv 
  rm -rf textpack/tests/craigslistVehiclesFull.csv textpack/tests/craigslist-carstrucks-data.zip
fi

python -m unittest discover -s textpack/tests

# Delete the craigslistVehicles file (1.1GB)
rm -rf textpack/tests/craigslistVehicles.csv


================================================
FILE: setup.py
================================================
import setuptools

with open("README.md", "r") as fh:
    long_description = fh.read()

setuptools.setup(
    name="textpack",
    version="0.0.5",
    author="Luke Whyte",
    author_email="lukeawhyte@gmail.com",
    description="Quickly identify and group similar text strings in a large dataset",
    long_description=long_description,
    long_description_content_type="text/markdown",
    url="https://github.com/lukewhyte/textpack",
    packages=setuptools.find_packages(),
    install_requires=[
        'pandas',
        'sklearn',
        'scipy',
        'numpy',
        'cython',
        'sparse-dot-topn'
    ],
    classifiers=[
        "Programming Language :: Python :: 3",
        "License :: OSI Approved :: MIT License",
        "Operating System :: OS Independent",
    ],
)


================================================
FILE: textpack/__init__.py
================================================
name = "textpack"

================================================
FILE: textpack/tests/test_tp.py
================================================
import unittest
import json

from textpack import tp


class TestTextPack(unittest.TestCase):
    def test_build_group_index(self):
        cl_vehicles = tp.read_csv('./textpack/tests/craigslistVehicles.csv', ['manufacturer', 'make'])
        cl_vehicles.build_group_lookup()

        expected = 'toyotacamry'
        result = cl_vehicles.group_lookup['toyotacamry dx v6']

        self.assertEqual(expected, result)

    def test_add_grouped_column_to_data(self):
        cl_vehicles = tp.read_csv('./textpack/tests/craigslistVehicles.csv', ['manufacturer', 'make'])

        cl_vehicles.build_group_lookup()
        cl_vehicles.add_grouped_column_to_data()

        expected = 443405
        result = cl_vehicles.df['textpackGrouper'].shape[0]

        self.assertEqual(expected, result)

    def test_export_json(self):
        cl_vehicles = tp.read_csv('./textpack/tests/craigslistVehicles.csv', ['manufacturer', 'make'])

        cl_vehicles.run()

        expected = 'https://youngstown.craigslist.org/ctd/d/youngstown-2018-kia-forte-lx-sedan-4d/6886405952.html'

        print('Loading JSON, hold fast...')
        result = json.loads(cl_vehicles.export_json())['url']['439513']

        self.assertEqual(expected, result)


================================================
FILE: textpack/tp.py
================================================
import re
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sparse_dot_topn import awesome_cossim_topn


class TextPack():
    def __init__(self, df, columns_to_group, match_threshold=0.75, ngram_remove=r'[,-./]', ngram_length=3):
        self.df = df
        self.group_lookup = {}
        self._column = self._get_column(columns_to_group)
        self._match_threshold = match_threshold
        self._ngram_remove = ngram_remove
        self._ngram_length = ngram_length

    def _get_column(self, columns_to_group):
        if ''.join(columns_to_group) in self.df.columns:
            return ''.join(columns_to_group)
        else:
            self.df['textpackGrouper'] = self.df[columns_to_group.pop(0)].astype(str).str.cat(self.df[columns_to_group].astype(str))
            return 'textpackGrouper'

    def _ngrams_analyzer(self, string):
        string = re.sub(self._ngram_remove, r'', string)
        ngrams = zip(*[string[i:] for i in range(self._ngram_length)])
        return [''.join(ngram) for ngram in ngrams]

    def _get_tf_idf_matrix(self, vals):
        vectorizer = TfidfVectorizer(analyzer=self._ngrams_analyzer)
        return vectorizer.fit_transform(vals)

    def _get_cosine_matrix(self, vals):
        tf_idf_matrix = self._get_tf_idf_matrix(vals)
        return awesome_cossim_topn(tf_idf_matrix, tf_idf_matrix.transpose(), vals.size, self._match_threshold)

    def _find_group(self, y, x):
        if y in self.group_lookup:
            return self.group_lookup[y]
        elif x in self.group_lookup:
            return self.group_lookup[x]
        else:
            return None

    def _add_vals_to_lookup(self, group, y, x):
        self.group_lookup[y] = group
        self.group_lookup[x] = group

    def _add_pair_to_lookup(self, row, col):
        group = self._find_group(row, col)
        if group is not None:
            self._add_vals_to_lookup(group, row, col)
        else:
            self._add_vals_to_lookup(row, row, col)

    def set_ngram_remove(self, ngram_remove):
        self._ngram_remove = ngram_remove

    def set_ngram_length(self, ngram_length):
        self._ngram_length = ngram_length

    def set_match_threshold(self, match_threshold):
        self._match_threshold = match_threshold

    def build_group_lookup(self):
        vals = self.df[self._column].unique().astype('U')

        print('Building the TF-IDF, Cosine & Coord matrices...')
        coord_matrix = self._get_cosine_matrix(vals).tocoo()

        print('Building the group lookup...')
        for row, col in zip(coord_matrix.row, coord_matrix.col):
            if row != col:
                self._add_pair_to_lookup(vals[row], vals[col])

    def add_grouped_column_to_data(self, column_name='Group'):
        print('Adding grouped columns to data frame...')
        self.df[column_name] = self.df[self._column].map(self.group_lookup).fillna(self.df[self._column])

    def run(self, column_name='Group'):
        self.build_group_lookup()
        self.add_grouped_column_to_data(column_name)
        print('Ready for export')

    def _filter_df_for_export(self):
        return self.df.drop(columns=['textpackGrouper']) if 'textpackGrouper' in self.df.columns else self.df

    def export_json(self, export_path=None):
        return self._filter_df_for_export().to_json(export_path)

    def export_csv(self, export_path=None):
        return self._filter_df_for_export().to_csv(export_path)


def read_json(json_path, columns_to_group, match_threshold=0.75, ngram_remove=r'[,-./]', ngram_length=3):
    return TextPack(pd.read_json(json_path), columns_to_group, match_threshold, ngram_remove, ngram_length)


def read_excel(excel_path, columns_to_group, sheet_name=None, match_threshold=0.75, ngram_remove=r'[,-./]', ngram_length=3):
    return TextPack(pd.read_excel(excel_path), sheet_name, columns_to_group, match_threshold, ngram_remove, ngram_length)


def read_csv(csv_path, columns_to_group, match_threshold=0.75, ngram_remove=r'[,-./]', ngram_length=3):
    return TextPack(pd.read_csv(csv_path), columns_to_group, match_threshold, ngram_remove, ngram_length)
Download .txt
gitextract_6qdfimih/

├── .gitignore
├── LICENSE
├── Pipfile
├── README.md
├── run_tests.sh
├── setup.py
└── textpack/
    ├── __init__.py
    ├── tests/
    │   └── test_tp.py
    └── tp.py
Download .txt
SYMBOL INDEX (25 symbols across 2 files)

FILE: textpack/tests/test_tp.py
  class TestTextPack (line 7) | class TestTextPack(unittest.TestCase):
    method test_build_group_index (line 8) | def test_build_group_index(self):
    method test_add_grouped_column_to_data (line 17) | def test_add_grouped_column_to_data(self):
    method test_export_json (line 28) | def test_export_json(self):

FILE: textpack/tp.py
  class TextPack (line 8) | class TextPack():
    method __init__ (line 9) | def __init__(self, df, columns_to_group, match_threshold=0.75, ngram_r...
    method _get_column (line 17) | def _get_column(self, columns_to_group):
    method _ngrams_analyzer (line 24) | def _ngrams_analyzer(self, string):
    method _get_tf_idf_matrix (line 29) | def _get_tf_idf_matrix(self, vals):
    method _get_cosine_matrix (line 33) | def _get_cosine_matrix(self, vals):
    method _find_group (line 37) | def _find_group(self, y, x):
    method _add_vals_to_lookup (line 45) | def _add_vals_to_lookup(self, group, y, x):
    method _add_pair_to_lookup (line 49) | def _add_pair_to_lookup(self, row, col):
    method set_ngram_remove (line 56) | def set_ngram_remove(self, ngram_remove):
    method set_ngram_length (line 59) | def set_ngram_length(self, ngram_length):
    method set_match_threshold (line 62) | def set_match_threshold(self, match_threshold):
    method build_group_lookup (line 65) | def build_group_lookup(self):
    method add_grouped_column_to_data (line 76) | def add_grouped_column_to_data(self, column_name='Group'):
    method run (line 80) | def run(self, column_name='Group'):
    method _filter_df_for_export (line 85) | def _filter_df_for_export(self):
    method export_json (line 88) | def export_json(self, export_path=None):
    method export_csv (line 91) | def export_csv(self, export_path=None):
  function read_json (line 95) | def read_json(json_path, columns_to_group, match_threshold=0.75, ngram_r...
  function read_excel (line 99) | def read_excel(excel_path, columns_to_group, sheet_name=None, match_thre...
  function read_csv (line 103) | def read_csv(csv_path, columns_to_group, match_threshold=0.75, ngram_rem...
Condensed preview — 9 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (16K chars).
[
  {
    "path": ".gitignore",
    "chars": 1253,
    "preview": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packagi"
  },
  {
    "path": "LICENSE",
    "chars": 1068,
    "preview": "MIT License\n\nCopyright (c) [2019] [textpack]\n\nPermission is hereby granted, free of charge, to any person obtaining a co"
  },
  {
    "path": "Pipfile",
    "chars": 261,
    "preview": "[[source]]\nname = \"pypi\"\nurl = \"https://pypi.org/simple\"\nverify_ssl = true\n\n[dev-packages]\npep8 = \"*\"\npylint = \"*\"\n\n[pac"
  },
  {
    "path": "README.md",
    "chars": 5885,
    "preview": "## What is this?\n\nTextPack efficiently groups similar values in large (or small) datasets. Under the hood, it builds a d"
  },
  {
    "path": "run_tests.sh",
    "chars": 521,
    "preview": "if [ ! -f textpack/tests/craigslistVehicles.csv ]; then # state files\n  kaggle datasets download -d austinreese/craigsli"
  },
  {
    "path": "setup.py",
    "chars": 795,
    "preview": "import setuptools\n\nwith open(\"README.md\", \"r\") as fh:\n    long_description = fh.read()\n\nsetuptools.setup(\n    name=\"text"
  },
  {
    "path": "textpack/__init__.py",
    "chars": 17,
    "preview": "name = \"textpack\""
  },
  {
    "path": "textpack/tests/test_tp.py",
    "chars": 1230,
    "preview": "import unittest\nimport json\n\nfrom textpack import tp\n\n\nclass TestTextPack(unittest.TestCase):\n    def test_build_group_i"
  },
  {
    "path": "textpack/tp.py",
    "chars": 4163,
    "preview": "import re\nimport numpy as np\nimport pandas as pd\nfrom sklearn.feature_extraction.text import TfidfVectorizer\nfrom sparse"
  }
]

About this extraction

This page contains the full source code of the lukewhyte/textpack GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 9 files (14.8 KB), approximately 4.0k tokens, and a symbol index with 25 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Copied to clipboard!