Repository: etano/productner Branch: master Commit: 8c511964be69 Files: 16 Total size: 44.1 KB Directory structure: gitextract_hvlacfzd/ ├── .gitignore ├── Pipfile ├── README.md ├── classifier.py ├── data/ │ ├── groups.py │ ├── normalize.py │ ├── parse.py │ ├── supplement.py │ ├── tag.py │ └── trim.py ├── extract.py ├── ner.py ├── tokenizer.py ├── train_classifier.py ├── train_ner.py └── train_tokenizer.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: .gitignore ================================================ *.swp *.pyc *.swo *.swn *.txt *.csv *.json *.h5 .idea/* *.zip *.gz .DS_Store /models ================================================ FILE: Pipfile ================================================ [[source]] url = "https://pypi.python.org/simple" verify_ssl = true name = "pypi" [dev-packages] [packages] keras = "*" sklearn = "*" tensorflow = "*" "h5py" = "*" ================================================ FILE: README.md ================================================ # Product categorization and named entity recognition This repository is meant to automatically extract features from product titles and descriptions. Below we explain how to install and run the code, and the implemented algorithms. We also provide background information including the current state-of-the-art in both sequence classification and sequence tagging, and suggest possible improvements to the current implemention. Enjoy! ## Requirements Use Python 3.7 and install dependencies via following command (please use venv or conda): ``` pip install -r requirements.txt ``` ## Usage ### Fetching data #### Amazon product data cd ./data/ wget http://snap.stanford.edu/data/amazon/productGraph/metadata.json.gz gzip -d metadata.json.gz #### GloVe cd ./data/ wget https://nlp.stanford.edu/data/glove.6B.zip unzip glove.6B.zip ### Preprocessing data cd ./data/ python parse.py metadata.json python normalize.py products.csv python trim.py products.normalized.csv python supplement.py products.normalized.trimmed.csv python tag.py products.normalized.trimmed.supplemented.csv ### Training models mkdir -p ./models/ python train_tokenizer.py data/products.normalized.trimmed.supplemented.tagged.csv python train_classifier.py data/products.normalized.trimmed.supplemented.tagged.csv python train_ner.py data/products.normalized.trimmed.supplemented.tagged.csv ### Extract information Infer on our sample dataset with your model by running the following: python extract.py ./models/ Product\ Dataset.csv ## Contents - extract.py: Script to extract product category specific attributes based on product titles and descriptions - train_tokenizer.py: Script to train a word tokenizer - train_ner.py: Script to train a product named entity recognizer based on product titles - train_classifier.py: Script to train a product category classifier based on product titles and descriptions - tokenizer.py: Word tokenizer class - ner.py: Named entity recognition class - classifier.py: Product classifier class - data/parse.py: Parses Amazon product metadata found at http://snap.stanford.edu/data/amazon/productGraph/metadata.json.gz - data/normalize.py: Normalizes product data - data/trim.py: Trims product data - data/supplement.py: Supplements product data - data/tag.py: Tags product data - Product\ Dataset.csv: CSV file with product ids, names, and descriptions ## Algorithms These are the methods used in this demonstrative implementation. For state-of-the-art extensions, we refer the reader to the references listed below. - Tokenization: Built-in Keras tokenizer with 80,000 word maximum - Embedding: Stanford GloVe (Wikipedia 2014 + Gigaword 5, 200 dimensions) with 200 sequence length maximum - Sequence classification: 3 layer CNN with max pooling between the layers - Sequence tagging: Bidirectional LSTM For the sequence classification task, we extract product titles, descriptions, and categories from the Amazon product corpus. We then fit our CNN model to predict product category based on a combination of product title and description. On 800K samples with a batch size of 256, we achieve an overall f1 score of ~0.90 after 2 epochs. For the sequence tagging task, we extract product titles and brands from the Amazon product corpus. We then fit our bidirection LSTM model to label each word token in the product title to be either a brand or not. On 800K samples with a batch size of 256, we achieve an overall f1 score of ~0.85 after 2 epochs. For both models we use the GloVe embedding with 200 dimensions, though we note that a larger dimensional embedding might achieve superior performance. Additionally, we could be more careful in the data preprocessing to trim bad tokens (e.g. HTML remnants). Also for both models we use a dropout layer after embedding to combat overfitting the data. ## Background ### Problem definition The problem of extracting features from unstructured textual data can be given different names depending on the circumstances and desired outcome. Generally, we can split tasks into two camps: sequence classification and sequence tagging. In sequence classification, we take a text fragment (usually a sentence up to an entire document), and try to project it into a categorical space. This is considered a many-to-one classification in that we are taking a set of many features and producing a single output. Sequence tagging, on the other hand, is often considered a many-to-many problem since you take in an entire sequence and attempt to apply a label to each element of the sequence. An example of sequence tagging is part of speech labeling, where one attempts to label the part of speech of each word in a sentence. Other methods that fall into this camp include chunking (breaking a sentence into relational components) and named entity recognition (extracting pre-specified features like geographic locations or proper names). ### Tokenization and embedding An often important step in any natural language processing task is projecting from the character-based space that composes words and sentences to a numeric space on which computer models can operate. The first step is simply to index unique tokens appearing in a dataset. There is some freedom on what is considered a token, i.e. it can be considered a specific group of words, a single word, or even individual characters. A popular choice is to simple create a word-based dictionary which maps unique space-separated character sequences to unique indices. Usually this is done after a normalization procedure where everything is lower-cased, made into ASCII, etc. This dictionary can then be sorted by frequency of occurance in the dataset and truncated to a maximum size. After tokenization, your dataset is transformed into a set of indices where truncated words are typically replaced with a '0' index. Following tokenization, the indexed words are often projected into an embedding vector space. Currently popular embeddings include word2vec [\[1\]](#references) and GloVe [\[2\]](#references). Word2vec (as the name implies) is a word to vector space projector composed of a two-layer neural network. The network is trained in one of two ways: a continuous bag-of-words where the model attempts to predict the current word by using the surrounding words as context features, and continuous skip-grams where the model attempts to predict surrounding context words by looking at the current word. GloVe is a "global vectors representation for words". Essentially it is a count-based, unsupervised learned embedding where a token cooccurance matrix is constructed and factored. Vector-space embedding methods generally provide substantial improvements over using basic dictionaries since they inject contextual knowledge from the language. Additionally, they allow a much more compact representation, while maintaining important correlations. For example, they allow you to do amazing things like performing word arithmetic: king - man + woman = queen where equality is determined by directly computing vector overlaps. ### Sequence classification Classification is the shining pillar of modern day machine learning with convolutional neural networks (CNN) at the top. With their ability to efficiently represent high-level features via windowed filtering, CNN's have seen their largest success in the classification and segmentation of images. However, more recently, CNN's have started seeing success in natural language sequence classification as well. Several recent works have shown that for the text classification, CNN's can significantly outperform other classifying methods such as hidden Markov models and support vector machines [\[3,4\]](#references). The reason CNN's see success in text classification is likely for the same reason they see success in the vision domain: there are strong, regular correlations between nearby features which are efficiently picked up by reasonably sized filters. Even more recently CNN's dominance has been toppled by the recurrent neural network (RNN) architectures. In particular, long-/short-term memory (LSTM) units have shown exceptional promise. LSTM's pass output from one unit to the next, while carrying along an internal state. How this state updates (as well as other weights in the network) can be trained end-to-end on variable length sequences by passing a single token at a time. For classification, bidirectional LSTM's, which allow for long-range contextual correlations in both forward and reverse directions, have seen the best performance [\[5,6\]](#references). An additional feature of these networks is an attention layer that allows continuous addressing of internal states of the sequential LSTM units. This further strengthens the networks ability to draw correlations from both nearby and far away tokens. ### Sequence tagging As mentioned above sequence tagging is a many-to-many machine learning task, and thus an added emphasis on the sequential nature of the input and output. This makes largely CNN's ill-suited for the problem. Instead the dominant approaches are again bidirectional LSTM's [\[11,12\]](#references) as well as another method called conditional random fields (CRF) [\[7\]](#references). CRF's can be seen as either sequential logistic regression or more powerful hidden Markov models. Essentially they are sequential models composed of many defined feature functions that depend both on the word currently be labelled as well as surrounding words. The relative weights of these feature functions can then be trained via any supervised learning approach. CRF's are used extensively in the literature for both part of speech tagging as well as named entity recognition because of their ease of use and intuitive feeling [\[8-10\]](#references). Even more recent models for sequence tagging use a combination of the aforementioned methods (CNN, LSTM, and CRF) [\[13,14,15\]](#references). These works usually use a bidirectional LSTM as the major labeling architecture, another RNN or CNN to capture character-level information, and finally a CRF layer to model the label dependency. A logical next step will be to combine these methods with the neural attention models used in sequence classification, though this seems to be currently missing from the literature. ### Future directions Looking forward, there are several available avenues for continued research. More sophisticated word embeddings might help alleviate the need for complicated neural architectures. Hierarchical optimization methods can be used to automatically build new architectures as well as optimize hyperparameters. Diverse models can be intelligently combined to produce more powerful classification schemes (indeed most all Kaggle competitions are won this way). One interesting approach is to combine text data with other available data sources such as associated images [\[10\]](#references). By collecting data from different sources, feature labels could possibly be extracted automatically by cross-comparison. ## References [1] "Distributed Representations of Words and Phrases and their Compositionality". Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean. 2013. https://arxiv.org/abs/1310.4546. [2] "GloVe: Global Vector Representation for Words". Stanford NLP. 2015. https://nlp.stanford.edu/projects/glove/. [3] "Convolutional Neural Networks for Sentence Classification". Yoon Kim. 2014. "https://arxiv.org/abs/1408.5882. [4] "Character-level Convolutional Networks for Text Classification". Xiang Zhang, Junbo Zhao, Yann LeCun. 2015. https://arxiv.org/abs/1509.01626. [5] "Document Modeling with Gated Recurrent Neural Network for Sentiment Classification". Duyu Tang, Bing Qin, Ting Liu. 2015. http://www.emnlp2015.org/proceedings/EMNLP/pdf/EMNLP167.pdf. [6] "Hierarchical Attention Networks for Document Classification". Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, Eduard Hovy. 2016. https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf. [7] "An Introduction to Conditional Random Fields". Charles Sutton, Andrew McCallum. 2010. https://arxiv.org/abs/1011.4088. [8] "Attribute Extraction from Product Titles in eCommerce". Ajinkya More. 2016. https://arxiv.org/abs/1608.04670. [9] "Bootstrapped Named Entity Recognition for Product Attribute Extraction". Duangmanee (Pew) Putthividhya, Junling Hu. 2011. http://www.aclweb.org/anthology/D11-1144. [10] "A Machine Learning Approach for Product Matching and Categorization". Petar Ristoski, Petar Petrovski, Peter Mika, Heiko Paulheim. 2017. http://www.semantic-web-journal.net/content/machine-learning-approach-product-matching-and-categorization-0. [11] "Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss". Barbara Plank, Anders Søgaard, Yoav Goldberg. 2016. https://arxiv.org/abs/1604.05529. [12] "Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Recurrent Neural Network". Peilu Wang, Yao Qian, Frank K. Soong, Lei He, Hai Zhao. 2015. https://arxiv.org/abs/1510.06168. [13] "End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF". Xuezhe Ma, Eduard Hovy. 2016. https://arxiv.org/abs/1603.01354. [14] "Neural Architectures for Named Entity Recognition". Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, Chris Dyer. 2016. https://arxiv.org/abs/1603.01360. [15] "Neural Models for Sequence Chunking". Feifei Zhai, Saloni Potdar, Bing Xiang, Bowen Zhou. 2017. https://arxiv.org/abs/1701.04027. ================================================ FILE: classifier.py ================================================ """Product classifier class""" import json import os import numpy as np from keras.callbacks import ModelCheckpoint from keras.layers import Dense, Input, Flatten, Dropout, Conv1D, MaxPooling1D, Embedding from keras.models import load_model, Model from keras.utils.np_utils import to_categorical from sklearn.metrics import classification_report class ProductClassifier(object): """Class which classifies products based on various inputs Attributes: prefix (str): Model files prefix model (keras.model): Keras model category_map (dict(str, int)): Map between category names and indices """ def __init__(self, prefix=None): """Load in model and category map Args: prefix (str): Prefix of directory containing model HDF5 file and category map JSON file """ if prefix != None: self.load(prefix) else: self.prefix = 'models/classifier' self.model = None self.category_map = {} def load(self, prefix=None): """Load in model and category map Args: prefix (str): Prefix of directory containing model HDF5 file and category map JSON file """ if prefix != None: self.prefix = prefix self.model = load_model(self.prefix + '.h5') self.category_map = json.load(open(self.prefix + '.json', 'r')) def save(self, prefix=None): """Save in model and category map Args: prefix (str): Prefix of directory containing model HDF5 file and category map JSON file """ if prefix != None: self.prefix = prefix self.model.save(self.prefix + '.h5') with open(self.prefix + '.json', 'w') as out: json.dump(self.category_map, out) def index_categories(self, categories): """Take a list of possibly duplicate categories and create an index list Args: categories (list(str)): List of categories Returns: list(int): List of indices """ print('Indexing categories...') indices = [] for category in categories: if not (category in self.category_map): self.category_map[category] = len(self.category_map) indices.append(self.category_map[category]) print(('Found %s unique categories.' % len(self.category_map))) return indices def classify(self, data): """Classify by products by text Args: data (np.array): 2D array representing descriptions of the product and/or product title Returns: list(dict(str, float)): List of dictionaries of product categories with associated confidence """ prediction = self.model.predict(data) all_category_probs = [] for i in range(prediction.shape[0]): category_probs = {} for category in self.category_map: category_probs[category] = prediction[i, self.category_map[category]] all_category_probs.append(category_probs) return all_category_probs def get_labels(self, categories): """Create labels from a list of categories Args: categories (list(str)): A list of product categories Returns: (list(int)): List of indices """ indexed_categories = self.index_categories(categories) labels = to_categorical(np.asarray(indexed_categories)) return labels def compile(self, tokenizer, glove_dir='./data/', embedding_dim=100, dropout_fraction=0.0, kernal_size=5, n_filters=128): """Compile network model for classifier Args: glove_file (str): Location of GloVe file embedding_dim (int): Size of embedding vector tokenizer (WordTokenizer): Object used to tokenize orginal texts dropout_fraction (float): Fraction of randomly zeroed weights in dropout layer kernal_size (int): Size of sliding window for convolution n_filters (int): Number of filters to produce from convolution """ # Load embedding layer print('Loading GloVe embedding...') embeddings_index = {} f = open(os.path.join(glove_dir, 'glove.6B.' + str(embedding_dim) + 'd.txt'), 'r') for line in f: values = line.split() word = values[0] coefs = np.asarray(values[1:], dtype='float32') embeddings_index[word] = coefs f.close() print(('Found %s word vectors.' % len(embeddings_index))) # Create embedding layer print('Creating embedding layer...') embedding_matrix = np.zeros((len(tokenizer.tokenizer.word_index) + 1, embedding_dim)) for word, i in list(tokenizer.tokenizer.word_index.items()): embedding_vector = embeddings_index.get(word) if embedding_vector is not None: # words not found in embedding index will be all-zeros. embedding_matrix[i] = embedding_vector embedding_layer = Embedding(len(tokenizer.tokenizer.word_index) + 1, embedding_dim, weights=[embedding_matrix], input_length=tokenizer.max_sequence_length, trainable=False) # Create network print('Creating network...') sequence_input = Input(shape=(tokenizer.max_sequence_length,), dtype='int32') embedded_sequences = embedding_layer(sequence_input) x = Dropout(dropout_fraction)(embedded_sequences) x = Conv1D(n_filters, kernal_size, activation='relu')(x) x = MaxPooling1D(kernal_size)(x) x = Conv1D(n_filters, kernal_size, activation='relu')(x) x = MaxPooling1D(kernal_size)(x) x = Conv1D(n_filters, kernal_size, activation='relu')(x) x = MaxPooling1D(int(x.shape[1]))(x) # global max pooling x = Flatten()(x) x = Dense(n_filters, activation='relu')(x) preds = Dense(len(self.category_map), activation='softmax')(x) # Compile model print('Compiling network...') self.model = Model(sequence_input, preds) self.model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['acc']) def train(self, data, labels, validation_split=0.2, batch_size=256, epochs=2): """Train classifier Args: data (np.array): 3D numpy array (n_samples, embedding_dim, tokenizer.max_sequence_length) labels (np.array): 2D numpy array (n_samples, len(self.category_map)) validation_split (float): Fraction of samples to be used for validation batch_size (int): Training batch size epochs (int): Number of training epochs """ print('Training...') # Split the data into a training set and a validation set indices = np.arange(data.shape[0]) np.random.shuffle(indices) data = data[indices] labels = labels[indices] nb_validation_samples = int(validation_split * data.shape[0]) x_train = data[:-nb_validation_samples] y_train = labels[:-nb_validation_samples] x_val = data[-nb_validation_samples:] y_val = labels[-nb_validation_samples:] # Train! self.save() checkpointer = ModelCheckpoint(filepath=self.prefix + '.h5', verbose=1, save_best_only=False) self.model.fit(x_train, y_train, validation_data=(x_val, y_val), callbacks=[checkpointer], nb_epoch=epochs, batch_size=batch_size) self.evaluate(x_val, y_val, batch_size) def evaluate(self, x_test, y_test, batch_size=256): """Evaluate classifier Args: x_test (np.array): 3D numpy array (n_samples, embedding_dim, tokenizer.max_sequence_length) y_test (np.array): 2D numpy array (n_samples, len(self.category_map)) batch_size (int): Training batch size """ print('Evaluating...') predictions_last_epoch = self.model.predict(x_test, batch_size=batch_size, verbose=1) predicted_classes = np.argmax(predictions_last_epoch, axis=1) target_names = [''] * len(self.category_map) for category in self.category_map: target_names[self.category_map[category]] = category y_val = np.argmax(y_test, axis=1) print((classification_report(y_val, predicted_classes, target_names=target_names, digits=6, labels=range(len(self.category_map))))) ================================================ FILE: data/groups.py ================================================ import csv import sys from operator import itemgetter with open(sys.argv[1], 'r') as f: reader = csv.reader(f) brands, categories = {}, {} count = 0 for row in reader: count += 1 if not (count % 10000): print(count) brand = row[1] if brand in brands: brands[brand] += 1 else: brands[brand] = 1 category = row[3].split(' / ')[0] if category in categories: categories[category] += 1 else: categories[category] = 1 print(sorted(brands.items(), key=itemgetter(1))) print(sorted(categories.items(), key=itemgetter(1))) ================================================ FILE: data/normalize.py ================================================ """Normalizes product data""" import csv import sys def unescape(s): if sys.version_info >= (3, 0): import html output = html.unescape(str(s)) else: import htmllib p = htmllib.HTMLParser(None) p.save_bgn() try: p.feed(s) except: return s output = p.save_end() return output in_file = sys.argv[1] out_file = '.'.join(in_file.split('.')[:-1] + ['normalized'] + ['csv']) with open(in_file, 'r') as f: reader = csv.reader(f) writer = csv.writer(open(out_file, "w")) count = 0 for row in reader: count += 1 if not (count % 10000): print(count, 'rows normalized') row = [unescape(x).lower().replace('\\n', ' ') for x in row] writer.writerow(row) print(count, 'rows normalized') ================================================ FILE: data/parse.py ================================================ """Parses Amazon product metadata found at http://snap.stanford.edu/data/amazon/productGraph/metadata.json.gz""" import csv, sys, yaml from yaml import CLoader as Loader def usage(): print(""" USAGE: python parse.py metadata.json """) sys.exit(0) def main(argv): if len(argv) < 2: usage() filename = sys.argv[1] with open(filename, 'r') as f: count, good, bad = 0, 0, 0 out = csv.writer(open("products.csv", "w")) for line in f: count += 1 if not (count % 100000): print("count:", count, "good:", good, ", bad:", bad) if ("'title':" in line) and ("'brand':" in line) and ("'categories':" in line): try: line = line.rstrip().replace("\\'", "''") product = yaml.load(line, Loader=Loader) title, brand, categories = product['title'], product['brand'], product['categories'] description = product['description'] if 'description' in product else '' categories = ' / '.join([item for sublist in categories for item in sublist]) out.writerow([title, brand, description, categories]) good += 1 except Exception as e: print(line) print(e) bad += 1 print("good:", good, ", bad:", bad) if __name__ == "__main__": main(sys.argv) ================================================ FILE: data/supplement.py ================================================ """Supplements product data""" import csv import sys in_file = sys.argv[1] out_file = '.'.join(in_file.split('.')[:-1] + ['supplemented'] + ['csv']) with open(in_file, 'r') as f: reader = csv.reader(f) writer = csv.writer(open(out_file, "w")) count, supplemented = 0, 0 for row in reader: count += 1 if not (count % 10000): print(supplemented, '/', count, 'rows supplemented') title, brand, description = row[0], row[1], row[2] if not (brand in title): supplemented += 1 title = brand + ' ' + title description = title + ' ' + description row[0], row[1], row[2] = title, brand, description writer.writerow(row) print(supplemented, '/', count, 'rows supplemented') ================================================ FILE: data/tag.py ================================================ """Tags product data""" import csv import sys in_file = sys.argv[1] out_file = '.'.join(in_file.split('.')[:-1] + ['tagged'] + ['csv']) with open(in_file, 'r') as f: reader = csv.reader(f) writer = csv.writer(open(out_file, "w")) count = 0 for row in reader: count += 1 if not (count % 10000): print(count, 'rows tagged') title, brand, description = row[0], row[1], row[2] tagging = '' brand = brand.split(' ') brand_started = False for word in title.split(' '): if word == brand[0]: tagging += 'B-B ' brand_started = True elif len(brand) > 1 and brand_started: for b in brand[1:]: if word == b: tagging += 'I-B ' else: brand_started = False tagging += 'O ' else: brand_started = False tagging += 'O ' row.append(tagging) writer.writerow(row) print(count, 'rows tagged') ================================================ FILE: data/trim.py ================================================ """Trims product data""" import csv import sys in_file = sys.argv[1] out_file = '.'.join(in_file.split('.')[:-1] + ['trimmed'] + ['csv']) with open(in_file, 'r') as f: reader = csv.reader(f) writer = csv.writer(open(out_file, "w")) count, trimmed = 0, 0 for row in reader: try: count += 1 if not (count % 10000): print(trimmed, '/', count, 'rows trimmed') brand = row[1].lower() if brand == 'unknown' or brand == '' or brand == 'generic': trimmed += 1 continue writer.writerow(row) except: print(row) print(trimmed, '/', count, 'rows trimmed') ================================================ FILE: extract.py ================================================ """Script to extract product category specific attributes based on product titles and descriptions """ import csv import os import sys from operator import itemgetter from classifier import ProductClassifier from ner import ProductNER from tokenizer import WordTokenizer def process(row, tokenizer, classifier, ner): """Run a row through processing pipeline tokenize -> classify -> extract attributes Args: row (dict(str: str)): Dictionary of field name/field value pairs tokenizer (WordTokenizer): Word tokenizer classifier (ProductClassifier): Product classifier Returns: dict(str, float): Dictionary of product categories with associated confidence list(list(str)): List of pairs of attribute type and attribute value """ # Classify data = tokenizer.tokenize([row['name'] + ' ' + row['description']]) categories = classifier.classify(data)[0] row['category'] = max(list(categories.items()), key=itemgetter(1))[0] # Extract entities data = tokenizer.tokenize([row['name']]) tags = ner.tag(data)[0] brand, brand_started = '', False for word, tag in zip(row['name'].split(' '), tags): max_tag = max(list(tag.items()), key=itemgetter(1))[0] if 'B-B' in max_tag and (not brand_started): brand = word brand_started = True elif 'I-B' in max_tag and brand_started: brand += ' ' + word else: brand_started = False row['brand'] = brand return row def usage(): print(""" USAGE: python extract.py model_dir data_file.csv FORMAT: "id","name","description","price" """) sys.exit(0) def main(argv): if len(argv) < 3: usage() model_dir = sys.argv[1] data_file = sys.argv[2] # Load tokenizer tokenizer = WordTokenizer() tokenizer.load(os.path.join(model_dir, 'tokenizer')) # Load classifier classifier = ProductClassifier() classifier.load(os.path.join(model_dir, 'classifier')) # Load named entity recognizer ner = ProductNER() ner.load(os.path.join(model_dir, 'ner')) with open(data_file, 'r', encoding="iso-8859-1") as f: reader = csv.DictReader(f) with open('.'.join(data_file.split('.')[:-1] + ['processed', 'csv']), 'w', encoding="utf-8") as outfile: writer = csv.DictWriter(outfile, fieldnames=reader.fieldnames + ['category', 'brand']) writer.writeheader() count = 0 for row in reader: count += 1 processed_row = process(row, tokenizer, classifier, ner) print(processed_row) writer.writerow(processed_row) if __name__ == "__main__": main(sys.argv) ================================================ FILE: ner.py ================================================ """Named entity recognition class""" import json import os import numpy as np from keras.callbacks import ModelCheckpoint from keras.layers import Dense, Embedding, LSTM, Bidirectional, TimeDistributed, Activation from keras.models import load_model, Sequential from keras.preprocessing.sequence import pad_sequences from keras.utils.np_utils import to_categorical from sklearn.metrics import classification_report class ProductNER(object): """Class which recognizes named entities Attributes: prefix (str): Model files prefix model (keras.model): Keras model tag_map (dict(str, int)): Map between tag names and indices """ def __init__(self, prefix=None): """Load in model and tag map Args: prefix (str): Prefix of directory containing model HDF5 file and tag map JSON file """ if prefix != None: self.load(prefix) else: self.prefix = 'models/ner' self.model = None self.tag_map = {} def load(self, prefix=None): """Load in model and tag map Args: prefix (str): Prefix of directory containing model HDF5 file and tag map JSON file """ if prefix != None: self.prefix = prefix self.model = load_model(self.prefix+'.h5') self.tag_map = json.load(open(self.prefix+'.json', 'r')) def save(self, prefix=None): """Save in model and tag map Args: prefix (str): Prefix of directory containing model HDF5 file and tag map JSON file """ if prefix != None: self.prefix = prefix self.model.save(self.prefix+'.h5') with open(self.prefix+'.json', 'w') as out: json.dump(self.tag_map, out) def tag(self, data): """Return all named entities given some embedded text Args: data (np.array): 2D array representing descriptions of the product and/or product title Returns: list(list(dict(str, float))): List of lists of entities """ prediction = self.model.predict(data) all_tag_probs = [] for i in range(prediction.shape[0]): sentence_tag_probs = [] first_word = 0 for j in range(data[i].shape[0]): if data[i,j] != 0: break first_word += 1 for j in range(first_word, prediction.shape[1]): word_tag_probs = {} for tag in self.tag_map: word_tag_probs[tag] = prediction[i,j,self.tag_map[tag]] sentence_tag_probs.append(word_tag_probs) all_tag_probs.append(sentence_tag_probs) return all_tag_probs def index_tags(self, tags): """Take a list of possibly duplicate tags and create an index list Args: tags (list(str)): List of tags Returns: list(int): List of indices """ indices = [] for tag in tags: if not (tag in self.tag_map): self.tag_map[tag] = len(self.tag_map) + 1 indices.append(self.tag_map[tag]) return indices def get_labels(self, tag_sets): """Create labels from a list of tag_sets Args: tag_sets (list(list(str))): A list of word tag sets Returns: (list(list(int))): List of list of indices """ labels = [] print('Getting labels...') for tag_set in tag_sets: indexed_tags = self.index_tags(tag_set) labels.append(to_categorical(np.asarray(indexed_tags), nb_classes=4)) labels = pad_sequences(labels, maxlen=200) return labels def compile(self, tokenizer, glove_dir='./data/', embedding_dim=200, dropout_fraction=0.2, hidden_dim=32): """Compile network model for NER Args: glove_file (str): Location of GloVe file embedding_dim (int): Size of embedding vector tokenizer (WordTokenizer): Object used to tokenize orginal texts dropout_fraction (float): Fraction of randomly zeroed weights in dropout layer hidden_dim (int): Hidden dimension """ # Load embedding layer print('Loading GloVe embedding...') embeddings_index = {} f = open(os.path.join(glove_dir, 'glove.6B.'+str(embedding_dim)+'d.txt'), 'r') for line in f: values = line.split() word = values[0] coefs = np.asarray(values[1:], dtype='float32') embeddings_index[word] = coefs f.close() print(('Found %s word vectors.' % len(embeddings_index))) # Create embedding layer print('Creating embedding layer...') embedding_matrix = np.zeros((len(tokenizer.tokenizer.word_index) + 1, embedding_dim)) for word, i in list(tokenizer.tokenizer.word_index.items()): embedding_vector = embeddings_index.get(word) if embedding_vector is not None: # words not found in embedding index will be all-zeros. embedding_matrix[i] = embedding_vector # Create network print('Creating network...') self.model = Sequential() self.model.add(Embedding(len(tokenizer.tokenizer.word_index) + 1, embedding_dim, weights=[embedding_matrix], input_length=tokenizer.max_sequence_length, trainable=False, mask_zero=True)) self.model.add(Bidirectional(LSTM(hidden_dim, return_sequences=True))) self.model.add(TimeDistributed(Dense(len(self.tag_map) + 1))) self.model.add(Activation('softmax')) # Compile model print('Compiling network...') self.model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc']) def train(self, data, labels, validation_split=0.2, batch_size=256, epochs=2): """Train ner Args: data (np.array): 3D numpy array (n_samples, embedding_dim, tokenizer.max_sequence_length) labels (np.array): 3D numpy array (n_samples, tokenizer.max_sequence_length, len(self.tag_map)) validation_split (float): Fraction of samples to be used for validation batch_size (int): Training batch size epochs (int): Number of training epochs """ print('Training...') # Split the data into a training set and a validation set indices = np.arange(data.shape[0]) np.random.shuffle(indices) data = data[indices] labels = labels[indices] nb_validation_samples = int(validation_split * data.shape[0]) x_train = data[:-nb_validation_samples] y_train = labels[:-nb_validation_samples] x_val = data[-nb_validation_samples:] y_val = labels[-nb_validation_samples:] print(data.shape, labels.shape) # Train! self.save() checkpointer = ModelCheckpoint(filepath=self.prefix+'.h5', verbose=1, save_best_only=False) self.model.fit(x_train, y_train, validation_data=(x_val, y_val), callbacks=[checkpointer], nb_epoch=epochs, batch_size=batch_size) self.evaluate(x_val, y_val, batch_size) def evaluate(self, x_test, y_test, batch_size=256): """Evaluate classifier Args: x_test (np.array): 2D numpy array (n_samples, tokenizer.max_sequence_length) y_test (np.array): 3D numpy array (n_samples, tokenizer.max_sequence_length, len(self.tag_map)) batch_size (int): Training batch size """ print('Evaluating...') predictions_last_epoch = self.model.predict(x_test, batch_size=batch_size, verbose=1) predicted_classes = np.argmax(predictions_last_epoch, axis=2).flatten() y_val = np.argmax(y_test, axis=2).flatten() target_names = ['']*(max(self.tag_map.values())+1) for category in self.tag_map: target_names[self.tag_map[category]] = category print((classification_report(y_val, predicted_classes, target_names=target_names, digits = 6, labels=range(len(target_names))))) ================================================ FILE: tokenizer.py ================================================ """Word tokenizer class""" import os import numpy as np try: import cPickle as pickle except: import pickle from keras.preprocessing.text import Tokenizer from keras.preprocessing.sequence import pad_sequences class WordTokenizer(object): """Class which tokenizes words Attributes: max_sequence_length (int): Maximum sequence length for embedding tokenizer (Tokenizer): Keras Tokenizer prefix (str): Prefix for tokenizer save file """ def __init__(self, max_sequence_length=200, prefix="./models/tokenizer"): """Create tokenizer Args: max_sequence_length (int): Maximum sequence length for texts prefix (str): Prefix for tokenizer save file """ self.max_sequence_length = max_sequence_length self.prefix = prefix self.tokenizer = None def save(self, prefix=None): """Saves the tokenizer Args: prefix (str): Prefix for tokenizer save file """ if prefix != None: self.prefix = prefix pickle.dump(self.tokenizer, open(self.prefix + ".pickle", "wb")) def load(self, prefix=None): """Loads the tokenizer """ if prefix != None: self.prefix = prefix self.tokenizer = pickle.load(open(self.prefix + ".pickle", "rb")) def train(self, texts, max_nb_words=80000): """Takes a list of texts, fits a tokenizer to them, and creates the embedding matrix. Args: texts (list(str)): List of texts max_nb_words: Maximum number of words indexed (take most frequently used) """ # Tokenize print('Training tokenizer...') self.tokenizer = Tokenizer(nb_words=max_nb_words) self.tokenizer.fit_on_texts(texts) self.save() print(('Found %s unique tokens.' % len(self.tokenizer.word_index))) def tokenize(self, texts): """Takes a list of texts and tokenizes them. Args: texts (list(str)): List of texts Returns: np.array: 2D numpy array (len(texts), self.max_sequence_length) """ sequences = self.tokenizer.texts_to_sequences(texts) data = pad_sequences(sequences, maxlen=self.max_sequence_length) return data ================================================ FILE: train_classifier.py ================================================ """Script to train a product category classifier based on product titles and descriptions """ import csv import sys from classifier import ProductClassifier from tokenizer import WordTokenizer MAX_TEXTS = 1000000 def usage(): print(""" USAGE: python train_classifier.py data_file.csv FORMAT: "title","brand","description","categories" """) sys.exit(0) def main(argv): if len(argv) < 2: usage() # Fetch data texts, categories = [], [] with open(sys.argv[1], 'r') as f: reader = csv.DictReader(f, fieldnames=["title", "brand", "description", "categories"]) count = 0 for row in reader: count += 1 text, category = row['description'], row['categories'].split(' / ')[0] texts.append(text) categories.append(category) if count >= MAX_TEXTS: break print(('Processed %s texts.' % len(texts))) # Tokenize texts tokenizer = WordTokenizer() tokenizer.load() data = tokenizer.tokenize(texts) # Get labels from classifier classifier = ProductClassifier() labels = classifier.get_labels(categories) # Compile classifier network and train classifier.compile(tokenizer) classifier.train(data, labels, epochs=2) if __name__ == "__main__": main(sys.argv) ================================================ FILE: train_ner.py ================================================ """Script to train a product category ner based on product titles and descriptions """ import csv import sys from ner import ProductNER from tokenizer import WordTokenizer MAX_TEXTS = 1000000 def usage(): print(""" USAGE: python train_ner.py data_file.csv FORMAT: "title","brand","description","categories" """) sys.exit(0) def main(argv): if len(argv) < 2: usage() # Fetch data texts, tags = [], [] with open(sys.argv[1], 'r') as f: reader = csv.DictReader(f, fieldnames=["title", "brand", "description", "categories", "tags"]) count = 0 for row in reader: count += 1 text, tag_set = row['title'], row['tags'].split(' ')[:-1] texts.append(text) tags.append(tag_set) if count >= MAX_TEXTS: break print(('Processed %s texts.' % len(texts))) # Tokenize texts tokenizer = WordTokenizer() tokenizer.load() data = tokenizer.tokenize(texts) # Get labels from NER ner = ProductNER() labels = ner.get_labels(tags) # Compile NER network and train ner.compile(tokenizer) ner.train(data, labels, epochs=2) if __name__ == "__main__": main(sys.argv) ================================================ FILE: train_tokenizer.py ================================================ """Script to train a word tokenizer """ import csv import sys from tokenizer import WordTokenizer MAX_TEXTS = 1000000 def usage(): print(""" USAGE: python train_tokenizer.py data_file.csv FORMAT: "title","brand","description","categories" """) sys.exit(0) def main(argv): if len(argv) < 2: usage() # Fetch data texts, categories = [], [] with open(sys.argv[1], 'r') as f: reader = csv.DictReader(f, fieldnames=["title", "brand", "description", "categories"]) count = 0 for row in reader: count += 1 text, category = row['title'] + ' ' + row['description'], row['categories'].split(' / ')[0] texts.append(text) categories.append(category) if count >= MAX_TEXTS: break print(('Processed %s texts.' % len(texts))) # Tokenize texts tokenizer = WordTokenizer() tokenizer.train(texts) if __name__ == "__main__": main(sys.argv)