Repository: etano/productner
Branch: master
Commit: 8c511964be69
Files: 16
Total size: 44.1 KB
Directory structure:
gitextract_hvlacfzd/
├── .gitignore
├── Pipfile
├── README.md
├── classifier.py
├── data/
│ ├── groups.py
│ ├── normalize.py
│ ├── parse.py
│ ├── supplement.py
│ ├── tag.py
│ └── trim.py
├── extract.py
├── ner.py
├── tokenizer.py
├── train_classifier.py
├── train_ner.py
└── train_tokenizer.py
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitignore
================================================
*.swp
*.pyc
*.swo
*.swn
*.txt
*.csv
*.json
*.h5
.idea/*
*.zip
*.gz
.DS_Store
/models
================================================
FILE: Pipfile
================================================
[[source]]
url = "https://pypi.python.org/simple"
verify_ssl = true
name = "pypi"
[dev-packages]
[packages]
keras = "*"
sklearn = "*"
tensorflow = "*"
"h5py" = "*"
================================================
FILE: README.md
================================================
# Product categorization and named entity recognition
This repository is meant to automatically extract features from product titles and descriptions. Below we explain how to install and run the code, and the implemented algorithms. We also provide background information including the current state-of-the-art in both sequence classification and sequence tagging, and suggest possible improvements to the current implemention. Enjoy!
## Requirements
Use Python 3.7 and install dependencies via following command (please use venv or conda):
```
pip install -r requirements.txt
```
## Usage
### Fetching data
#### Amazon product data
cd ./data/
wget http://snap.stanford.edu/data/amazon/productGraph/metadata.json.gz
gzip -d metadata.json.gz
#### GloVe
cd ./data/
wget https://nlp.stanford.edu/data/glove.6B.zip
unzip glove.6B.zip
### Preprocessing data
cd ./data/
python parse.py metadata.json
python normalize.py products.csv
python trim.py products.normalized.csv
python supplement.py products.normalized.trimmed.csv
python tag.py products.normalized.trimmed.supplemented.csv
### Training models
mkdir -p ./models/
python train_tokenizer.py data/products.normalized.trimmed.supplemented.tagged.csv
python train_classifier.py data/products.normalized.trimmed.supplemented.tagged.csv
python train_ner.py data/products.normalized.trimmed.supplemented.tagged.csv
### Extract information
Infer on our sample dataset with your model by running the following:
python extract.py ./models/ Product\ Dataset.csv
## Contents
- extract.py: Script to extract product category specific attributes based on product titles and descriptions
- train_tokenizer.py: Script to train a word tokenizer
- train_ner.py: Script to train a product named entity recognizer based on product titles
- train_classifier.py: Script to train a product category classifier based on product titles and descriptions
- tokenizer.py: Word tokenizer class
- ner.py: Named entity recognition class
- classifier.py: Product classifier class
- data/parse.py: Parses Amazon product metadata found at http://snap.stanford.edu/data/amazon/productGraph/metadata.json.gz
- data/normalize.py: Normalizes product data
- data/trim.py: Trims product data
- data/supplement.py: Supplements product data
- data/tag.py: Tags product data
- Product\ Dataset.csv: CSV file with product ids, names, and descriptions
## Algorithms
These are the methods used in this demonstrative implementation. For state-of-the-art extensions, we refer the reader to the references listed below.
- Tokenization: Built-in Keras tokenizer with 80,000 word maximum
- Embedding: Stanford GloVe (Wikipedia 2014 + Gigaword 5, 200 dimensions) with 200 sequence length maximum
- Sequence classification: 3 layer CNN with max pooling between the layers
- Sequence tagging: Bidirectional LSTM
For the sequence classification task, we extract product titles, descriptions, and categories from the Amazon product corpus. We then fit our CNN model to predict product category based on a combination of product title and description. On 800K samples with a batch size of 256, we achieve an overall f1 score of ~0.90 after 2 epochs.
For the sequence tagging task, we extract product titles and brands from the Amazon product corpus. We then fit our bidirection LSTM model to label each word token in the product title to be either a brand or not. On 800K samples with a batch size of 256, we achieve an overall f1 score of ~0.85 after 2 epochs.
For both models we use the GloVe embedding with 200 dimensions, though we note that a larger dimensional embedding might achieve superior performance. Additionally, we could be more careful in the data preprocessing to trim bad tokens (e.g. HTML remnants). Also for both models we use a dropout layer after embedding to combat overfitting the data.
## Background
### Problem definition
The problem of extracting features from unstructured textual data can be given different names depending on the circumstances and desired outcome. Generally, we can split tasks into two camps: sequence classification and sequence tagging.
In sequence classification, we take a text fragment (usually a sentence up to an entire document), and try to project it into a categorical space. This is considered a many-to-one classification in that we are taking a set of many features and producing a single output.
Sequence tagging, on the other hand, is often considered a many-to-many problem since you take in an entire sequence and attempt to apply a label to each element of the sequence. An example of sequence tagging is part of speech labeling, where one attempts to label the part of speech of each word in a sentence. Other methods that fall into this camp include chunking (breaking a sentence into relational components) and named entity recognition (extracting pre-specified features like geographic locations or proper names).
### Tokenization and embedding
An often important step in any natural language processing task is projecting from the character-based space that composes words and sentences to a numeric space on which computer models can operate.
The first step is simply to index unique tokens appearing in a dataset. There is some freedom on what is considered a token, i.e. it can be considered a specific group of words, a single word, or even individual characters. A popular choice is to simple create a word-based dictionary which maps unique space-separated character sequences to unique indices. Usually this is done after a normalization procedure where everything is lower-cased, made into ASCII, etc. This dictionary can then be sorted by frequency of occurance in the dataset and truncated to a maximum size. After tokenization, your dataset is transformed into a set of indices where truncated words are typically replaced with a '0' index.
Following tokenization, the indexed words are often projected into an embedding vector space. Currently popular embeddings include word2vec [\[1\]](#references) and GloVe [\[2\]](#references). Word2vec (as the name implies) is a word to vector space projector composed of a two-layer neural network. The network is trained in one of two ways: a continuous bag-of-words where the model attempts to predict the current word by using the surrounding words as context features, and continuous skip-grams where the model attempts to predict surrounding context words by looking at the current word. GloVe is a "global vectors representation for words". Essentially it is a count-based, unsupervised learned embedding where a token cooccurance matrix is constructed and factored.
Vector-space embedding methods generally provide substantial improvements over using basic dictionaries since they inject contextual knowledge from the language. Additionally, they allow a much more compact representation, while maintaining important correlations. For example, they allow you to do amazing things like performing word arithmetic:
king - man + woman = queen
where equality is determined by directly computing vector overlaps.
### Sequence classification
Classification is the shining pillar of modern day machine learning with convolutional neural networks (CNN) at the top. With their ability to efficiently represent high-level features via windowed filtering, CNN's have seen their largest success in the classification and segmentation of images. However, more recently, CNN's have started seeing success in natural language sequence classification as well. Several recent works have shown that for the text classification, CNN's can significantly outperform other classifying methods such as hidden Markov models and support vector machines [\[3,4\]](#references). The reason CNN's see success in text classification is likely for the same reason they see success in the vision domain: there are strong, regular correlations between nearby features which are efficiently picked up by reasonably sized filters.
Even more recently CNN's dominance has been toppled by the recurrent neural network (RNN) architectures. In particular, long-/short-term memory (LSTM) units have shown exceptional promise. LSTM's pass output from one unit to the next, while carrying along an internal state. How this state updates (as well as other weights in the network) can be trained end-to-end on variable length sequences by passing a single token at a time. For classification, bidirectional LSTM's, which allow for long-range contextual correlations in both forward and reverse directions, have seen the best performance [\[5,6\]](#references). An additional feature of these networks is an attention layer that allows continuous addressing of internal states of the sequential LSTM units. This further strengthens the networks ability to draw correlations from both nearby and far away tokens.
### Sequence tagging
As mentioned above sequence tagging is a many-to-many machine learning task, and thus an added emphasis on the sequential nature of the input and output. This makes largely CNN's ill-suited for the problem. Instead the dominant approaches are again bidirectional LSTM's [\[11,12\]](#references) as well as another method called conditional random fields (CRF) [\[7\]](#references). CRF's can be seen as either sequential logistic regression or more powerful hidden Markov models. Essentially they are sequential models composed of many defined feature functions that depend both on the word currently be labelled as well as surrounding words. The relative weights of these feature functions can then be trained via any supervised learning approach. CRF's are used extensively in the literature for both part of speech tagging as well as named entity recognition because of their ease of use and intuitive feeling [\[8-10\]](#references).
Even more recent models for sequence tagging use a combination of the aforementioned methods (CNN, LSTM, and CRF) [\[13,14,15\]](#references). These works usually use a bidirectional LSTM as the major labeling architecture, another RNN or CNN to capture character-level information, and finally a CRF layer to model the label dependency. A logical next step will be to combine these methods with the neural attention models used in sequence classification, though this seems to be currently missing from the literature.
### Future directions
Looking forward, there are several available avenues for continued research. More sophisticated word embeddings might help alleviate the need for complicated neural architectures. Hierarchical optimization methods can be used to automatically build new architectures as well as optimize hyperparameters. Diverse models can be intelligently combined to produce more powerful classification schemes (indeed most all Kaggle competitions are won this way). One interesting approach is to combine text data with other available data sources such as associated images [\[10\]](#references). By collecting data from different sources, feature labels could possibly be extracted automatically by cross-comparison.
## References
[1] "Distributed Representations of Words and Phrases and their Compositionality". Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean. 2013. https://arxiv.org/abs/1310.4546.
[2] "GloVe: Global Vector Representation for Words". Stanford NLP. 2015. https://nlp.stanford.edu/projects/glove/.
[3] "Convolutional Neural Networks for Sentence Classification". Yoon Kim. 2014. "https://arxiv.org/abs/1408.5882.
[4] "Character-level Convolutional Networks for Text Classification". Xiang Zhang, Junbo Zhao, Yann LeCun. 2015. https://arxiv.org/abs/1509.01626.
[5] "Document Modeling with Gated Recurrent Neural Network for Sentiment Classification". Duyu Tang, Bing Qin, Ting Liu. 2015. http://www.emnlp2015.org/proceedings/EMNLP/pdf/EMNLP167.pdf.
[6] "Hierarchical Attention Networks for Document Classification". Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, Eduard Hovy. 2016. https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf.
[7] "An Introduction to Conditional Random Fields". Charles Sutton, Andrew McCallum. 2010. https://arxiv.org/abs/1011.4088.
[8] "Attribute Extraction from Product Titles in eCommerce". Ajinkya More. 2016. https://arxiv.org/abs/1608.04670.
[9] "Bootstrapped Named Entity Recognition for Product Attribute Extraction". Duangmanee (Pew) Putthividhya, Junling Hu. 2011. http://www.aclweb.org/anthology/D11-1144.
[10] "A Machine Learning Approach for Product Matching and Categorization". Petar Ristoski, Petar Petrovski, Peter Mika, Heiko Paulheim. 2017. http://www.semantic-web-journal.net/content/machine-learning-approach-product-matching-and-categorization-0.
[11] "Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss". Barbara Plank, Anders Søgaard, Yoav Goldberg. 2016. https://arxiv.org/abs/1604.05529.
[12] "Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Recurrent Neural Network". Peilu Wang, Yao Qian, Frank K. Soong, Lei He, Hai Zhao. 2015. https://arxiv.org/abs/1510.06168.
[13] "End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF". Xuezhe Ma, Eduard Hovy. 2016. https://arxiv.org/abs/1603.01354.
[14] "Neural Architectures for Named Entity Recognition". Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, Chris Dyer. 2016. https://arxiv.org/abs/1603.01360.
[15] "Neural Models for Sequence Chunking". Feifei Zhai, Saloni Potdar, Bing Xiang, Bowen Zhou. 2017. https://arxiv.org/abs/1701.04027.
================================================
FILE: classifier.py
================================================
"""Product classifier class"""
import json
import os
import numpy as np
from keras.callbacks import ModelCheckpoint
from keras.layers import Dense, Input, Flatten, Dropout, Conv1D, MaxPooling1D, Embedding
from keras.models import load_model, Model
from keras.utils.np_utils import to_categorical
from sklearn.metrics import classification_report
class ProductClassifier(object):
"""Class which classifies products based on various inputs
Attributes:
prefix (str): Model files prefix
model (keras.model): Keras model
category_map (dict(str, int)): Map between category names and indices
"""
def __init__(self, prefix=None):
"""Load in model and category map
Args:
prefix (str): Prefix of directory containing model HDF5 file and category map JSON file
"""
if prefix != None:
self.load(prefix)
else:
self.prefix = 'models/classifier'
self.model = None
self.category_map = {}
def load(self, prefix=None):
"""Load in model and category map
Args:
prefix (str): Prefix of directory containing model HDF5 file and category map JSON file
"""
if prefix != None: self.prefix = prefix
self.model = load_model(self.prefix + '.h5')
self.category_map = json.load(open(self.prefix + '.json', 'r'))
def save(self, prefix=None):
"""Save in model and category map
Args:
prefix (str): Prefix of directory containing model HDF5 file and category map JSON file
"""
if prefix != None: self.prefix = prefix
self.model.save(self.prefix + '.h5')
with open(self.prefix + '.json', 'w') as out:
json.dump(self.category_map, out)
def index_categories(self, categories):
"""Take a list of possibly duplicate categories and create an index list
Args:
categories (list(str)): List of categories
Returns:
list(int): List of indices
"""
print('Indexing categories...')
indices = []
for category in categories:
if not (category in self.category_map):
self.category_map[category] = len(self.category_map)
indices.append(self.category_map[category])
print(('Found %s unique categories.' % len(self.category_map)))
return indices
def classify(self, data):
"""Classify by products by text
Args:
data (np.array): 2D array representing descriptions of the product and/or product title
Returns:
list(dict(str, float)): List of dictionaries of product categories with associated confidence
"""
prediction = self.model.predict(data)
all_category_probs = []
for i in range(prediction.shape[0]):
category_probs = {}
for category in self.category_map:
category_probs[category] = prediction[i, self.category_map[category]]
all_category_probs.append(category_probs)
return all_category_probs
def get_labels(self, categories):
"""Create labels from a list of categories
Args:
categories (list(str)): A list of product categories
Returns:
(list(int)): List of indices
"""
indexed_categories = self.index_categories(categories)
labels = to_categorical(np.asarray(indexed_categories))
return labels
def compile(self, tokenizer, glove_dir='./data/', embedding_dim=100, dropout_fraction=0.0, kernal_size=5,
n_filters=128):
"""Compile network model for classifier
Args:
glove_file (str): Location of GloVe file
embedding_dim (int): Size of embedding vector
tokenizer (WordTokenizer): Object used to tokenize orginal texts
dropout_fraction (float): Fraction of randomly zeroed weights in dropout layer
kernal_size (int): Size of sliding window for convolution
n_filters (int): Number of filters to produce from convolution
"""
# Load embedding layer
print('Loading GloVe embedding...')
embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.' + str(embedding_dim) + 'd.txt'), 'r')
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()
print(('Found %s word vectors.' % len(embeddings_index)))
# Create embedding layer
print('Creating embedding layer...')
embedding_matrix = np.zeros((len(tokenizer.tokenizer.word_index) + 1, embedding_dim))
for word, i in list(tokenizer.tokenizer.word_index.items()):
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
# words not found in embedding index will be all-zeros.
embedding_matrix[i] = embedding_vector
embedding_layer = Embedding(len(tokenizer.tokenizer.word_index) + 1,
embedding_dim,
weights=[embedding_matrix],
input_length=tokenizer.max_sequence_length,
trainable=False)
# Create network
print('Creating network...')
sequence_input = Input(shape=(tokenizer.max_sequence_length,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
x = Dropout(dropout_fraction)(embedded_sequences)
x = Conv1D(n_filters, kernal_size, activation='relu')(x)
x = MaxPooling1D(kernal_size)(x)
x = Conv1D(n_filters, kernal_size, activation='relu')(x)
x = MaxPooling1D(kernal_size)(x)
x = Conv1D(n_filters, kernal_size, activation='relu')(x)
x = MaxPooling1D(int(x.shape[1]))(x) # global max pooling
x = Flatten()(x)
x = Dense(n_filters, activation='relu')(x)
preds = Dense(len(self.category_map), activation='softmax')(x)
# Compile model
print('Compiling network...')
self.model = Model(sequence_input, preds)
self.model.compile(loss='categorical_crossentropy',
optimizer='rmsprop',
metrics=['acc'])
def train(self, data, labels, validation_split=0.2, batch_size=256, epochs=2):
"""Train classifier
Args:
data (np.array): 3D numpy array (n_samples, embedding_dim, tokenizer.max_sequence_length)
labels (np.array): 2D numpy array (n_samples, len(self.category_map))
validation_split (float): Fraction of samples to be used for validation
batch_size (int): Training batch size
epochs (int): Number of training epochs
"""
print('Training...')
# Split the data into a training set and a validation set
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
nb_validation_samples = int(validation_split * data.shape[0])
x_train = data[:-nb_validation_samples]
y_train = labels[:-nb_validation_samples]
x_val = data[-nb_validation_samples:]
y_val = labels[-nb_validation_samples:]
# Train!
self.save()
checkpointer = ModelCheckpoint(filepath=self.prefix + '.h5', verbose=1, save_best_only=False)
self.model.fit(x_train, y_train, validation_data=(x_val, y_val),
callbacks=[checkpointer],
nb_epoch=epochs, batch_size=batch_size)
self.evaluate(x_val, y_val, batch_size)
def evaluate(self, x_test, y_test, batch_size=256):
"""Evaluate classifier
Args:
x_test (np.array): 3D numpy array (n_samples, embedding_dim, tokenizer.max_sequence_length)
y_test (np.array): 2D numpy array (n_samples, len(self.category_map))
batch_size (int): Training batch size
"""
print('Evaluating...')
predictions_last_epoch = self.model.predict(x_test, batch_size=batch_size, verbose=1)
predicted_classes = np.argmax(predictions_last_epoch, axis=1)
target_names = [''] * len(self.category_map)
for category in self.category_map:
target_names[self.category_map[category]] = category
y_val = np.argmax(y_test, axis=1)
print((classification_report(y_val, predicted_classes, target_names=target_names, digits=6,
labels=range(len(self.category_map)))))
================================================
FILE: data/groups.py
================================================
import csv
import sys
from operator import itemgetter
with open(sys.argv[1], 'r') as f:
reader = csv.reader(f)
brands, categories = {}, {}
count = 0
for row in reader:
count += 1
if not (count % 10000): print(count)
brand = row[1]
if brand in brands:
brands[brand] += 1
else:
brands[brand] = 1
category = row[3].split(' / ')[0]
if category in categories:
categories[category] += 1
else:
categories[category] = 1
print(sorted(brands.items(), key=itemgetter(1)))
print(sorted(categories.items(), key=itemgetter(1)))
================================================
FILE: data/normalize.py
================================================
"""Normalizes product data"""
import csv
import sys
def unescape(s):
if sys.version_info >= (3, 0):
import html
output = html.unescape(str(s))
else:
import htmllib
p = htmllib.HTMLParser(None)
p.save_bgn()
try:
p.feed(s)
except:
return s
output = p.save_end()
return output
in_file = sys.argv[1]
out_file = '.'.join(in_file.split('.')[:-1] + ['normalized'] + ['csv'])
with open(in_file, 'r') as f:
reader = csv.reader(f)
writer = csv.writer(open(out_file, "w"))
count = 0
for row in reader:
count += 1
if not (count % 10000):
print(count, 'rows normalized')
row = [unescape(x).lower().replace('\\n', ' ') for x in row]
writer.writerow(row)
print(count, 'rows normalized')
================================================
FILE: data/parse.py
================================================
"""Parses Amazon product metadata found at http://snap.stanford.edu/data/amazon/productGraph/metadata.json.gz"""
import csv, sys, yaml
from yaml import CLoader as Loader
def usage():
print("""
USAGE: python parse.py metadata.json
""")
sys.exit(0)
def main(argv):
if len(argv) < 2:
usage()
filename = sys.argv[1]
with open(filename, 'r') as f:
count, good, bad = 0, 0, 0
out = csv.writer(open("products.csv", "w"))
for line in f:
count += 1
if not (count % 100000):
print("count:", count, "good:", good, ", bad:", bad)
if ("'title':" in line) and ("'brand':" in line) and ("'categories':" in line):
try:
line = line.rstrip().replace("\\'", "''")
product = yaml.load(line, Loader=Loader)
title, brand, categories = product['title'], product['brand'], product['categories']
description = product['description'] if 'description' in product else ''
categories = ' / '.join([item for sublist in categories for item in sublist])
out.writerow([title, brand, description, categories])
good += 1
except Exception as e:
print(line)
print(e)
bad += 1
print("good:", good, ", bad:", bad)
if __name__ == "__main__":
main(sys.argv)
================================================
FILE: data/supplement.py
================================================
"""Supplements product data"""
import csv
import sys
in_file = sys.argv[1]
out_file = '.'.join(in_file.split('.')[:-1] + ['supplemented'] + ['csv'])
with open(in_file, 'r') as f:
reader = csv.reader(f)
writer = csv.writer(open(out_file, "w"))
count, supplemented = 0, 0
for row in reader:
count += 1
if not (count % 10000):
print(supplemented, '/', count, 'rows supplemented')
title, brand, description = row[0], row[1], row[2]
if not (brand in title):
supplemented += 1
title = brand + ' ' + title
description = title + ' ' + description
row[0], row[1], row[2] = title, brand, description
writer.writerow(row)
print(supplemented, '/', count, 'rows supplemented')
================================================
FILE: data/tag.py
================================================
"""Tags product data"""
import csv
import sys
in_file = sys.argv[1]
out_file = '.'.join(in_file.split('.')[:-1] + ['tagged'] + ['csv'])
with open(in_file, 'r') as f:
reader = csv.reader(f)
writer = csv.writer(open(out_file, "w"))
count = 0
for row in reader:
count += 1
if not (count % 10000):
print(count, 'rows tagged')
title, brand, description = row[0], row[1], row[2]
tagging = ''
brand = brand.split(' ')
brand_started = False
for word in title.split(' '):
if word == brand[0]:
tagging += 'B-B '
brand_started = True
elif len(brand) > 1 and brand_started:
for b in brand[1:]:
if word == b:
tagging += 'I-B '
else:
brand_started = False
tagging += 'O '
else:
brand_started = False
tagging += 'O '
row.append(tagging)
writer.writerow(row)
print(count, 'rows tagged')
================================================
FILE: data/trim.py
================================================
"""Trims product data"""
import csv
import sys
in_file = sys.argv[1]
out_file = '.'.join(in_file.split('.')[:-1] + ['trimmed'] + ['csv'])
with open(in_file, 'r') as f:
reader = csv.reader(f)
writer = csv.writer(open(out_file, "w"))
count, trimmed = 0, 0
for row in reader:
try:
count += 1
if not (count % 10000):
print(trimmed, '/', count, 'rows trimmed')
brand = row[1].lower()
if brand == 'unknown' or brand == '' or brand == 'generic':
trimmed += 1
continue
writer.writerow(row)
except:
print(row)
print(trimmed, '/', count, 'rows trimmed')
================================================
FILE: extract.py
================================================
"""Script to extract product category specific attributes based on product titles and descriptions
"""
import csv
import os
import sys
from operator import itemgetter
from classifier import ProductClassifier
from ner import ProductNER
from tokenizer import WordTokenizer
def process(row, tokenizer, classifier, ner):
"""Run a row through processing pipeline
tokenize -> classify
-> extract attributes
Args:
row (dict(str: str)): Dictionary of field name/field value pairs
tokenizer (WordTokenizer): Word tokenizer
classifier (ProductClassifier): Product classifier
Returns:
dict(str, float): Dictionary of product categories with associated confidence
list(list(str)): List of pairs of attribute type and attribute value
"""
# Classify
data = tokenizer.tokenize([row['name'] + ' ' + row['description']])
categories = classifier.classify(data)[0]
row['category'] = max(list(categories.items()), key=itemgetter(1))[0]
# Extract entities
data = tokenizer.tokenize([row['name']])
tags = ner.tag(data)[0]
brand, brand_started = '', False
for word, tag in zip(row['name'].split(' '), tags):
max_tag = max(list(tag.items()), key=itemgetter(1))[0]
if 'B-B' in max_tag and (not brand_started):
brand = word
brand_started = True
elif 'I-B' in max_tag and brand_started:
brand += ' ' + word
else:
brand_started = False
row['brand'] = brand
return row
def usage():
print("""
USAGE: python extract.py model_dir data_file.csv
FORMAT: "id","name","description","price"
""")
sys.exit(0)
def main(argv):
if len(argv) < 3:
usage()
model_dir = sys.argv[1]
data_file = sys.argv[2]
# Load tokenizer
tokenizer = WordTokenizer()
tokenizer.load(os.path.join(model_dir, 'tokenizer'))
# Load classifier
classifier = ProductClassifier()
classifier.load(os.path.join(model_dir, 'classifier'))
# Load named entity recognizer
ner = ProductNER()
ner.load(os.path.join(model_dir, 'ner'))
with open(data_file, 'r', encoding="iso-8859-1") as f:
reader = csv.DictReader(f)
with open('.'.join(data_file.split('.')[:-1] + ['processed', 'csv']), 'w', encoding="utf-8") as outfile:
writer = csv.DictWriter(outfile, fieldnames=reader.fieldnames + ['category', 'brand'])
writer.writeheader()
count = 0
for row in reader:
count += 1
processed_row = process(row, tokenizer, classifier, ner)
print(processed_row)
writer.writerow(processed_row)
if __name__ == "__main__":
main(sys.argv)
================================================
FILE: ner.py
================================================
"""Named entity recognition class"""
import json
import os
import numpy as np
from keras.callbacks import ModelCheckpoint
from keras.layers import Dense, Embedding, LSTM, Bidirectional, TimeDistributed, Activation
from keras.models import load_model, Sequential
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical
from sklearn.metrics import classification_report
class ProductNER(object):
"""Class which recognizes named entities
Attributes:
prefix (str): Model files prefix
model (keras.model): Keras model
tag_map (dict(str, int)): Map between tag names and indices
"""
def __init__(self, prefix=None):
"""Load in model and tag map
Args:
prefix (str): Prefix of directory containing model HDF5 file and tag map JSON file
"""
if prefix != None:
self.load(prefix)
else:
self.prefix = 'models/ner'
self.model = None
self.tag_map = {}
def load(self, prefix=None):
"""Load in model and tag map
Args:
prefix (str): Prefix of directory containing model HDF5 file and tag map JSON file
"""
if prefix != None: self.prefix = prefix
self.model = load_model(self.prefix+'.h5')
self.tag_map = json.load(open(self.prefix+'.json', 'r'))
def save(self, prefix=None):
"""Save in model and tag map
Args:
prefix (str): Prefix of directory containing model HDF5 file and tag map JSON file
"""
if prefix != None: self.prefix = prefix
self.model.save(self.prefix+'.h5')
with open(self.prefix+'.json', 'w') as out:
json.dump(self.tag_map, out)
def tag(self, data):
"""Return all named entities given some embedded text
Args:
data (np.array): 2D array representing descriptions of the product and/or product title
Returns:
list(list(dict(str, float))): List of lists of entities
"""
prediction = self.model.predict(data)
all_tag_probs = []
for i in range(prediction.shape[0]):
sentence_tag_probs = []
first_word = 0
for j in range(data[i].shape[0]):
if data[i,j] != 0: break
first_word += 1
for j in range(first_word, prediction.shape[1]):
word_tag_probs = {}
for tag in self.tag_map:
word_tag_probs[tag] = prediction[i,j,self.tag_map[tag]]
sentence_tag_probs.append(word_tag_probs)
all_tag_probs.append(sentence_tag_probs)
return all_tag_probs
def index_tags(self, tags):
"""Take a list of possibly duplicate tags and create an index list
Args:
tags (list(str)): List of tags
Returns:
list(int): List of indices
"""
indices = []
for tag in tags:
if not (tag in self.tag_map):
self.tag_map[tag] = len(self.tag_map) + 1
indices.append(self.tag_map[tag])
return indices
def get_labels(self, tag_sets):
"""Create labels from a list of tag_sets
Args:
tag_sets (list(list(str))): A list of word tag sets
Returns:
(list(list(int))): List of list of indices
"""
labels = []
print('Getting labels...')
for tag_set in tag_sets:
indexed_tags = self.index_tags(tag_set)
labels.append(to_categorical(np.asarray(indexed_tags), nb_classes=4))
labels = pad_sequences(labels, maxlen=200)
return labels
def compile(self, tokenizer, glove_dir='./data/', embedding_dim=200, dropout_fraction=0.2, hidden_dim=32):
"""Compile network model for NER
Args:
glove_file (str): Location of GloVe file
embedding_dim (int): Size of embedding vector
tokenizer (WordTokenizer): Object used to tokenize orginal texts
dropout_fraction (float): Fraction of randomly zeroed weights in dropout layer
hidden_dim (int): Hidden dimension
"""
# Load embedding layer
print('Loading GloVe embedding...')
embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.'+str(embedding_dim)+'d.txt'), 'r')
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()
print(('Found %s word vectors.' % len(embeddings_index)))
# Create embedding layer
print('Creating embedding layer...')
embedding_matrix = np.zeros((len(tokenizer.tokenizer.word_index) + 1, embedding_dim))
for word, i in list(tokenizer.tokenizer.word_index.items()):
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
# words not found in embedding index will be all-zeros.
embedding_matrix[i] = embedding_vector
# Create network
print('Creating network...')
self.model = Sequential()
self.model.add(Embedding(len(tokenizer.tokenizer.word_index) + 1,
embedding_dim,
weights=[embedding_matrix],
input_length=tokenizer.max_sequence_length,
trainable=False,
mask_zero=True))
self.model.add(Bidirectional(LSTM(hidden_dim, return_sequences=True)))
self.model.add(TimeDistributed(Dense(len(self.tag_map) + 1)))
self.model.add(Activation('softmax'))
# Compile model
print('Compiling network...')
self.model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['acc'])
def train(self, data, labels, validation_split=0.2, batch_size=256, epochs=2):
"""Train ner
Args:
data (np.array): 3D numpy array (n_samples, embedding_dim, tokenizer.max_sequence_length)
labels (np.array): 3D numpy array (n_samples, tokenizer.max_sequence_length, len(self.tag_map))
validation_split (float): Fraction of samples to be used for validation
batch_size (int): Training batch size
epochs (int): Number of training epochs
"""
print('Training...')
# Split the data into a training set and a validation set
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
nb_validation_samples = int(validation_split * data.shape[0])
x_train = data[:-nb_validation_samples]
y_train = labels[:-nb_validation_samples]
x_val = data[-nb_validation_samples:]
y_val = labels[-nb_validation_samples:]
print(data.shape, labels.shape)
# Train!
self.save()
checkpointer = ModelCheckpoint(filepath=self.prefix+'.h5', verbose=1, save_best_only=False)
self.model.fit(x_train, y_train, validation_data=(x_val, y_val),
callbacks=[checkpointer],
nb_epoch=epochs, batch_size=batch_size)
self.evaluate(x_val, y_val, batch_size)
def evaluate(self, x_test, y_test, batch_size=256):
"""Evaluate classifier
Args:
x_test (np.array): 2D numpy array (n_samples, tokenizer.max_sequence_length)
y_test (np.array): 3D numpy array (n_samples, tokenizer.max_sequence_length, len(self.tag_map))
batch_size (int): Training batch size
"""
print('Evaluating...')
predictions_last_epoch = self.model.predict(x_test, batch_size=batch_size, verbose=1)
predicted_classes = np.argmax(predictions_last_epoch, axis=2).flatten()
y_val = np.argmax(y_test, axis=2).flatten()
target_names = ['']*(max(self.tag_map.values())+1)
for category in self.tag_map:
target_names[self.tag_map[category]] = category
print((classification_report(y_val, predicted_classes, target_names=target_names, digits = 6, labels=range(len(target_names)))))
================================================
FILE: tokenizer.py
================================================
"""Word tokenizer class"""
import os
import numpy as np
try:
import cPickle as pickle
except:
import pickle
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
class WordTokenizer(object):
"""Class which tokenizes words
Attributes:
max_sequence_length (int): Maximum sequence length for embedding
tokenizer (Tokenizer): Keras Tokenizer
prefix (str): Prefix for tokenizer save file
"""
def __init__(self, max_sequence_length=200, prefix="./models/tokenizer"):
"""Create tokenizer
Args:
max_sequence_length (int): Maximum sequence length for texts
prefix (str): Prefix for tokenizer save file
"""
self.max_sequence_length = max_sequence_length
self.prefix = prefix
self.tokenizer = None
def save(self, prefix=None):
"""Saves the tokenizer
Args:
prefix (str): Prefix for tokenizer save file
"""
if prefix != None: self.prefix = prefix
pickle.dump(self.tokenizer, open(self.prefix + ".pickle", "wb"))
def load(self, prefix=None):
"""Loads the tokenizer
"""
if prefix != None: self.prefix = prefix
self.tokenizer = pickle.load(open(self.prefix + ".pickle", "rb"))
def train(self, texts, max_nb_words=80000):
"""Takes a list of texts, fits a tokenizer to them, and creates the embedding matrix.
Args:
texts (list(str)): List of texts
max_nb_words: Maximum number of words indexed (take most frequently used)
"""
# Tokenize
print('Training tokenizer...')
self.tokenizer = Tokenizer(nb_words=max_nb_words)
self.tokenizer.fit_on_texts(texts)
self.save()
print(('Found %s unique tokens.' % len(self.tokenizer.word_index)))
def tokenize(self, texts):
"""Takes a list of texts and tokenizes them.
Args:
texts (list(str)): List of texts
Returns:
np.array: 2D numpy array (len(texts), self.max_sequence_length)
"""
sequences = self.tokenizer.texts_to_sequences(texts)
data = pad_sequences(sequences, maxlen=self.max_sequence_length)
return data
================================================
FILE: train_classifier.py
================================================
"""Script to train a product category classifier based on product titles and descriptions
"""
import csv
import sys
from classifier import ProductClassifier
from tokenizer import WordTokenizer
MAX_TEXTS = 1000000
def usage():
print("""
USAGE: python train_classifier.py data_file.csv
FORMAT: "title","brand","description","categories"
""")
sys.exit(0)
def main(argv):
if len(argv) < 2:
usage()
# Fetch data
texts, categories = [], []
with open(sys.argv[1], 'r') as f:
reader = csv.DictReader(f, fieldnames=["title", "brand", "description", "categories"])
count = 0
for row in reader:
count += 1
text, category = row['description'], row['categories'].split(' / ')[0]
texts.append(text)
categories.append(category)
if count >= MAX_TEXTS:
break
print(('Processed %s texts.' % len(texts)))
# Tokenize texts
tokenizer = WordTokenizer()
tokenizer.load()
data = tokenizer.tokenize(texts)
# Get labels from classifier
classifier = ProductClassifier()
labels = classifier.get_labels(categories)
# Compile classifier network and train
classifier.compile(tokenizer)
classifier.train(data, labels, epochs=2)
if __name__ == "__main__":
main(sys.argv)
================================================
FILE: train_ner.py
================================================
"""Script to train a product category ner based on product titles and descriptions
"""
import csv
import sys
from ner import ProductNER
from tokenizer import WordTokenizer
MAX_TEXTS = 1000000
def usage():
print("""
USAGE: python train_ner.py data_file.csv
FORMAT: "title","brand","description","categories"
""")
sys.exit(0)
def main(argv):
if len(argv) < 2:
usage()
# Fetch data
texts, tags = [], []
with open(sys.argv[1], 'r') as f:
reader = csv.DictReader(f, fieldnames=["title", "brand", "description", "categories", "tags"])
count = 0
for row in reader:
count += 1
text, tag_set = row['title'], row['tags'].split(' ')[:-1]
texts.append(text)
tags.append(tag_set)
if count >= MAX_TEXTS:
break
print(('Processed %s texts.' % len(texts)))
# Tokenize texts
tokenizer = WordTokenizer()
tokenizer.load()
data = tokenizer.tokenize(texts)
# Get labels from NER
ner = ProductNER()
labels = ner.get_labels(tags)
# Compile NER network and train
ner.compile(tokenizer)
ner.train(data, labels, epochs=2)
if __name__ == "__main__":
main(sys.argv)
================================================
FILE: train_tokenizer.py
================================================
"""Script to train a word tokenizer
"""
import csv
import sys
from tokenizer import WordTokenizer
MAX_TEXTS = 1000000
def usage():
print("""
USAGE: python train_tokenizer.py data_file.csv
FORMAT: "title","brand","description","categories"
""")
sys.exit(0)
def main(argv):
if len(argv) < 2:
usage()
# Fetch data
texts, categories = [], []
with open(sys.argv[1], 'r') as f:
reader = csv.DictReader(f, fieldnames=["title", "brand", "description", "categories"])
count = 0
for row in reader:
count += 1
text, category = row['title'] + ' ' + row['description'], row['categories'].split(' / ')[0]
texts.append(text)
categories.append(category)
if count >= MAX_TEXTS:
break
print(('Processed %s texts.' % len(texts)))
# Tokenize texts
tokenizer = WordTokenizer()
tokenizer.train(texts)
if __name__ == "__main__":
main(sys.argv)
gitextract_hvlacfzd/ ├── .gitignore ├── Pipfile ├── README.md ├── classifier.py ├── data/ │ ├── groups.py │ ├── normalize.py │ ├── parse.py │ ├── supplement.py │ ├── tag.py │ └── trim.py ├── extract.py ├── ner.py ├── tokenizer.py ├── train_classifier.py ├── train_ner.py └── train_tokenizer.py
SYMBOL INDEX (38 symbols across 9 files)
FILE: classifier.py
class ProductClassifier (line 14) | class ProductClassifier(object):
method __init__ (line 23) | def __init__(self, prefix=None):
method load (line 36) | def load(self, prefix=None):
method save (line 46) | def save(self, prefix=None):
method index_categories (line 57) | def index_categories(self, categories):
method classify (line 74) | def classify(self, data):
method get_labels (line 91) | def get_labels(self, categories):
method compile (line 103) | def compile(self, tokenizer, glove_dir='./data/', embedding_dim=100, d...
method train (line 163) | def train(self, data, labels, validation_split=0.2, batch_size=256, ep...
method evaluate (line 194) | def evaluate(self, x_test, y_test, batch_size=256):
FILE: data/normalize.py
function unescape (line 7) | def unescape(s):
FILE: data/parse.py
function usage (line 7) | def usage():
function main (line 14) | def main(argv):
FILE: extract.py
function process (line 14) | def process(row, tokenizer, classifier, ner):
function usage (line 51) | def usage():
function main (line 59) | def main(argv):
FILE: ner.py
class ProductNER (line 15) | class ProductNER(object):
method __init__ (line 24) | def __init__(self, prefix=None):
method load (line 37) | def load(self, prefix=None):
method save (line 47) | def save(self, prefix=None):
method tag (line 58) | def tag(self, data):
method index_tags (line 82) | def index_tags(self, tags):
method get_labels (line 97) | def get_labels(self, tag_sets):
method compile (line 113) | def compile(self, tokenizer, glove_dir='./data/', embedding_dim=200, d...
method train (line 163) | def train(self, data, labels, validation_split=0.2, batch_size=256, ep...
method evaluate (line 196) | def evaluate(self, x_test, y_test, batch_size=256):
FILE: tokenizer.py
class WordTokenizer (line 14) | class WordTokenizer(object):
method __init__ (line 23) | def __init__(self, max_sequence_length=200, prefix="./models/tokenizer"):
method save (line 34) | def save(self, prefix=None):
method load (line 43) | def load(self, prefix=None):
method train (line 49) | def train(self, texts, max_nb_words=80000):
method tokenize (line 63) | def tokenize(self, texts):
FILE: train_classifier.py
function usage (line 13) | def usage():
function main (line 21) | def main(argv):
FILE: train_ner.py
function usage (line 13) | def usage():
function main (line 21) | def main(argv):
FILE: train_tokenizer.py
function usage (line 12) | def usage():
function main (line 20) | def main(argv):
Condensed preview — 16 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (47K chars).
[
{
"path": ".gitignore",
"chars": 86,
"preview": "*.swp\n*.pyc\n*.swo\n*.swn\n*.txt\n*.csv\n*.json\n*.h5\n\n.idea/*\n*.zip\n*.gz\n.DS_Store\n/models\n"
},
{
"path": "Pipfile",
"chars": 171,
"preview": "[[source]]\n\nurl = \"https://pypi.python.org/simple\"\nverify_ssl = true\nname = \"pypi\"\n\n\n[dev-packages]\n\n\n\n[packages]\n\nkeras"
},
{
"path": "README.md",
"chars": 13607,
"preview": "# Product categorization and named entity recognition\n\nThis repository is meant to automatically extract features from p"
},
{
"path": "classifier.py",
"chars": 8761,
"preview": "\"\"\"Product classifier class\"\"\"\n\nimport json\nimport os\n\nimport numpy as np\nfrom keras.callbacks import ModelCheckpoint\nfr"
},
{
"path": "data/groups.py",
"chars": 651,
"preview": "import csv\nimport sys\nfrom operator import itemgetter\n\nwith open(sys.argv[1], 'r') as f:\n reader = csv.reader(f)\n "
},
{
"path": "data/normalize.py",
"chars": 842,
"preview": "\"\"\"Normalizes product data\"\"\"\n\nimport csv\nimport sys\n\n\ndef unescape(s):\n if sys.version_info >= (3, 0):\n impor"
},
{
"path": "data/parse.py",
"chars": 1472,
"preview": "\"\"\"Parses Amazon product metadata found at http://snap.stanford.edu/data/amazon/productGraph/metadata.json.gz\"\"\"\n\nimport"
},
{
"path": "data/supplement.py",
"chars": 778,
"preview": "\"\"\"Supplements product data\"\"\"\n\nimport csv\nimport sys\n\nin_file = sys.argv[1]\nout_file = '.'.join(in_file.split('.')[:-1]"
},
{
"path": "data/tag.py",
"chars": 1105,
"preview": "\"\"\"Tags product data\"\"\"\n\nimport csv\nimport sys\n\nin_file = sys.argv[1]\nout_file = '.'.join(in_file.split('.')[:-1] + ['ta"
},
{
"path": "data/trim.py",
"chars": 702,
"preview": "\"\"\"Trims product data\"\"\"\n\nimport csv\nimport sys\n\nin_file = sys.argv[1]\nout_file = '.'.join(in_file.split('.')[:-1] + ['t"
},
{
"path": "extract.py",
"chars": 2757,
"preview": "\"\"\"Script to extract product category specific attributes based on product titles and descriptions\n\"\"\"\n\nimport csv\nimpor"
},
{
"path": "ner.py",
"chars": 8384,
"preview": "\"\"\"Named entity recognition class\"\"\"\n\nimport json\nimport os\n\nimport numpy as np\nfrom keras.callbacks import ModelCheckpo"
},
{
"path": "tokenizer.py",
"chars": 2291,
"preview": "\"\"\"Word tokenizer class\"\"\"\n\nimport os\nimport numpy as np\n\ntry:\n import cPickle as pickle\nexcept:\n import pickle\nfr"
},
{
"path": "train_classifier.py",
"chars": 1331,
"preview": "\"\"\"Script to train a product category classifier based on product titles and descriptions\n\"\"\"\n\nimport csv\nimport sys\n\nfr"
},
{
"path": "train_ner.py",
"chars": 1230,
"preview": "\"\"\"Script to train a product category ner based on product titles and descriptions\n\"\"\"\n\nimport csv\nimport sys\n\nfrom ner "
},
{
"path": "train_tokenizer.py",
"chars": 984,
"preview": "\"\"\"Script to train a word tokenizer\n\"\"\"\n\nimport csv\nimport sys\n\nfrom tokenizer import WordTokenizer\n\nMAX_TEXTS = 1000000"
}
]
About this extraction
This page contains the full source code of the etano/productner GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 16 files (44.1 KB), approximately 10.3k tokens, and a symbol index with 38 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.