Repository: etano/productner
Branch: master
Commit: 8c511964be69
Files: 16
Total size: 44.1 KB

Directory structure:
gitextract_hvlacfzd/

├── .gitignore
├── Pipfile
├── README.md
├── classifier.py
├── data/
│   ├── groups.py
│   ├── normalize.py
│   ├── parse.py
│   ├── supplement.py
│   ├── tag.py
│   └── trim.py
├── extract.py
├── ner.py
├── tokenizer.py
├── train_classifier.py
├── train_ner.py
└── train_tokenizer.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
*.swp
*.pyc
*.swo
*.swn
*.txt
*.csv
*.json
*.h5

.idea/*
*.zip
*.gz
.DS_Store
/models


================================================
FILE: Pipfile
================================================
[[source]]

url = "https://pypi.python.org/simple"
verify_ssl = true
name = "pypi"


[dev-packages]


[packages]

keras = "*"
sklearn = "*"
tensorflow = "*"
"h5py" = "*"


================================================
FILE: README.md
================================================
# Product categorization and named entity recognition

This repository is meant to automatically extract features from product titles and descriptions. Below we explain how to install and run the code, and the implemented algorithms. We also provide background information including the current state-of-the-art in both sequence classification and sequence tagging, and suggest possible improvements to the current implemention. Enjoy!

## Requirements

Use Python 3.7 and install dependencies via following command (please use venv or conda):
```
pip install -r requirements.txt
```

## Usage

### Fetching data

#### Amazon product data

    cd ./data/
    wget http://snap.stanford.edu/data/amazon/productGraph/metadata.json.gz
    gzip -d metadata.json.gz

#### GloVe

    cd ./data/
    wget https://nlp.stanford.edu/data/glove.6B.zip
    unzip glove.6B.zip

### Preprocessing data

    cd ./data/
    python parse.py metadata.json
    python normalize.py products.csv
    python trim.py products.normalized.csv
    python supplement.py products.normalized.trimmed.csv
    python tag.py products.normalized.trimmed.supplemented.csv

### Training models

    mkdir -p ./models/
    python train_tokenizer.py data/products.normalized.trimmed.supplemented.tagged.csv
    python train_classifier.py data/products.normalized.trimmed.supplemented.tagged.csv
    python train_ner.py data/products.normalized.trimmed.supplemented.tagged.csv

### Extract information

Infer on our sample dataset with your model by running the following:

    python extract.py ./models/ Product\ Dataset.csv

## Contents

- extract.py: Script to extract product category specific attributes based on product titles and descriptions
- train_tokenizer.py: Script to train a word tokenizer
- train_ner.py: Script to train a product named entity recognizer based on product titles
- train_classifier.py: Script to train a product category classifier based on product titles and descriptions
- tokenizer.py: Word tokenizer class
- ner.py: Named entity recognition class
- classifier.py: Product classifier class
- data/parse.py: Parses Amazon product metadata found at http://snap.stanford.edu/data/amazon/productGraph/metadata.json.gz
- data/normalize.py: Normalizes product data
- data/trim.py: Trims product data
- data/supplement.py: Supplements product data
- data/tag.py: Tags product data
- Product\ Dataset.csv: CSV file with product ids, names, and descriptions

## Algorithms

These are the methods used in this demonstrative implementation. For state-of-the-art extensions, we refer the reader to the references listed below.

- Tokenization: Built-in Keras tokenizer with 80,000 word maximum
- Embedding: Stanford GloVe (Wikipedia 2014 + Gigaword 5, 200 dimensions) with 200 sequence length maximum
- Sequence classification: 3 layer CNN with max pooling between the layers
- Sequence tagging: Bidirectional LSTM

For the sequence classification task, we extract product titles, descriptions, and categories from the Amazon product corpus. We then fit our CNN model to predict product category based on a combination of product title and description. On 800K samples with a batch size of 256, we achieve an overall f1 score of ~0.90 after 2 epochs.

For the sequence tagging task, we extract product titles and brands from the Amazon product corpus. We then fit our bidirection LSTM model to label each word token in the product title to be either a brand or not. On 800K samples with a batch size of 256, we achieve an overall f1 score of ~0.85 after 2 epochs.

For both models we use the GloVe embedding with 200 dimensions, though we note that a larger dimensional embedding might achieve superior performance. Additionally, we could be more careful in the data preprocessing to trim bad tokens (e.g. HTML remnants). Also for both models we use a dropout layer after embedding to combat overfitting the data.

## Background

### Problem definition

The problem of extracting features from unstructured textual data can be given different names depending on the circumstances and desired outcome. Generally, we can split tasks into two camps: sequence classification and sequence tagging.

In sequence classification, we take a text fragment (usually a sentence up to an entire document), and try to project it into a categorical space. This is considered a many-to-one classification in that we are taking a set of many features and producing a single output.

Sequence tagging, on the other hand, is often considered a many-to-many problem since you take in an entire sequence and attempt to apply a label to each element of the sequence. An example of sequence tagging is part of speech labeling, where one attempts to label the part of speech of each word in a sentence. Other methods that fall into this camp include chunking (breaking a sentence into relational components) and named entity recognition (extracting pre-specified features like geographic locations or proper names).

### Tokenization and embedding

An often important step in any natural language processing task is projecting from the character-based space that composes words and sentences to a numeric space on which computer models can operate.

The first step is simply to index unique tokens appearing in a dataset. There is some freedom on what is considered a token, i.e. it can be considered a specific group of words, a single word, or even individual characters. A popular choice is to simple create a word-based dictionary which maps unique space-separated character sequences to unique indices. Usually this is done after a normalization procedure where everything is lower-cased, made into ASCII, etc. This dictionary can then be sorted by frequency of occurance in the dataset and truncated to a maximum size. After tokenization, your dataset is transformed into a set of indices where truncated words are typically replaced with a '0' index.

Following tokenization, the indexed words are often projected into an embedding vector space. Currently popular embeddings include word2vec [\[1\]](#references) and GloVe [\[2\]](#references). Word2vec (as the name implies) is a word to vector space projector composed of a two-layer neural network. The network is trained in one of two ways: a continuous bag-of-words where the model attempts to predict the current word by using the surrounding words as context features, and continuous skip-grams where the model attempts to predict surrounding context words by looking at the current word. GloVe is a "global vectors representation for words". Essentially it is a count-based, unsupervised learned embedding where a token cooccurance matrix is constructed and factored.

Vector-space embedding methods generally provide substantial improvements over using basic dictionaries since they inject contextual knowledge from the language. Additionally, they allow a much more compact representation, while maintaining important correlations. For example, they allow you to do amazing things like performing word arithmetic:

    king - man + woman = queen

where equality is determined by directly computing vector overlaps.

### Sequence classification

Classification is the shining pillar of modern day machine learning with convolutional neural networks (CNN) at the top. With their ability to efficiently represent high-level features via windowed filtering, CNN's have seen their largest success in the classification and segmentation of images. However, more recently, CNN's have started seeing success in natural language sequence classification as well. Several recent works have shown that for the text classification, CNN's can significantly outperform other classifying methods such as hidden Markov models and support vector machines [\[3,4\]](#references). The reason CNN's see success in text classification is likely for the same reason they see success in the vision domain: there are strong, regular correlations between nearby features which are efficiently picked up by reasonably sized filters.

Even more recently CNN's dominance has been toppled by the recurrent neural network (RNN) architectures. In particular, long-/short-term memory (LSTM) units have shown exceptional promise. LSTM's pass output from one unit to the next, while carrying along an internal state. How this state updates (as well as other weights in the network) can be trained end-to-end on variable length sequences by passing a single token at a time. For classification, bidirectional LSTM's, which allow for long-range contextual correlations in both forward and reverse directions, have seen the best performance [\[5,6\]](#references). An additional feature of these networks is an attention layer that allows continuous addressing of internal states of the sequential LSTM units. This further strengthens the networks ability to draw correlations from both nearby and far away tokens.

### Sequence tagging

As mentioned above sequence tagging is a many-to-many machine learning task, and thus an added emphasis on the sequential nature of the input and output. This makes largely CNN's ill-suited for the problem. Instead the dominant approaches are again bidirectional LSTM's [\[11,12\]](#references) as well as another method called conditional random fields (CRF) [\[7\]](#references). CRF's can be seen as either sequential logistic regression or more powerful hidden Markov models. Essentially they are sequential models composed of many defined feature functions that depend both on the word currently be labelled as well as surrounding words. The relative weights of these feature functions can then be trained via any supervised learning approach. CRF's are used extensively in the literature for both part of speech tagging as well as named entity recognition because of their ease of use and intuitive feeling [\[8-10\]](#references).

Even more recent models for sequence tagging use a combination of the aforementioned methods (CNN, LSTM, and CRF) [\[13,14,15\]](#references).  These works usually use a bidirectional LSTM as the major labeling architecture, another RNN or CNN to capture character-level information, and finally a CRF layer to model the label dependency. A logical next step will be to combine these methods with the neural attention models used in sequence classification, though this seems to be currently missing from the literature.

### Future directions

Looking forward, there are several available avenues for continued research. More sophisticated word embeddings might help alleviate the need for complicated neural architectures. Hierarchical optimization methods can be used to automatically build new architectures as well as optimize hyperparameters. Diverse models can be intelligently combined to produce more powerful classification schemes (indeed most all Kaggle competitions are won this way). One interesting approach is to combine text data with other available data sources such as associated images [\[10\]](#references). By collecting data from different sources, feature labels could possibly be extracted automatically by cross-comparison.

## References

[1] "Distributed Representations of Words and Phrases and their Compositionality". Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean. 2013. https://arxiv.org/abs/1310.4546.

[2] "GloVe: Global Vector Representation for Words". Stanford NLP. 2015. https://nlp.stanford.edu/projects/glove/.

[3] "Convolutional Neural Networks for Sentence Classification". Yoon Kim. 2014. "https://arxiv.org/abs/1408.5882.

[4] "Character-level Convolutional Networks for Text Classification". Xiang Zhang, Junbo Zhao, Yann LeCun. 2015. https://arxiv.org/abs/1509.01626.

[5] "Document Modeling with Gated Recurrent Neural Network for Sentiment Classification". Duyu Tang, Bing Qin, Ting Liu. 2015. http://www.emnlp2015.org/proceedings/EMNLP/pdf/EMNLP167.pdf.

[6] "Hierarchical Attention Networks for Document Classification". Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, Eduard Hovy. 2016. https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf.

[7] "An Introduction to Conditional Random Fields". Charles Sutton, Andrew McCallum. 2010. https://arxiv.org/abs/1011.4088.

[8] "Attribute Extraction from Product Titles in eCommerce". Ajinkya More. 2016. https://arxiv.org/abs/1608.04670.

[9] "Bootstrapped Named Entity Recognition for Product Attribute Extraction". Duangmanee (Pew) Putthividhya, Junling Hu. 2011. http://www.aclweb.org/anthology/D11-1144.

[10] "A Machine Learning Approach for Product Matching and Categorization". Petar Ristoski, Petar Petrovski, Peter Mika, Heiko Paulheim. 2017. http://www.semantic-web-journal.net/content/machine-learning-approach-product-matching-and-categorization-0.

[11] "Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss". Barbara Plank, Anders Søgaard, Yoav Goldberg. 2016. https://arxiv.org/abs/1604.05529.

[12] "Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Recurrent Neural Network". Peilu Wang, Yao Qian, Frank K. Soong, Lei He, Hai Zhao. 2015. https://arxiv.org/abs/1510.06168.

[13] "End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF". Xuezhe Ma, Eduard Hovy. 2016. https://arxiv.org/abs/1603.01354.

[14] "Neural Architectures for Named Entity Recognition". Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, Chris Dyer. 2016. https://arxiv.org/abs/1603.01360.

[15] "Neural Models for Sequence Chunking". Feifei Zhai, Saloni Potdar, Bing Xiang, Bowen Zhou. 2017. https://arxiv.org/abs/1701.04027.


================================================
FILE: classifier.py
================================================
"""Product classifier class"""

import json
import os

import numpy as np
from keras.callbacks import ModelCheckpoint
from keras.layers import Dense, Input, Flatten, Dropout, Conv1D, MaxPooling1D, Embedding
from keras.models import load_model, Model
from keras.utils.np_utils import to_categorical
from sklearn.metrics import classification_report


class ProductClassifier(object):
    """Class which classifies products based on various inputs

       Attributes:
           prefix (str): Model files prefix
           model (keras.model): Keras model
           category_map (dict(str, int)): Map between category names and indices
    """

    def __init__(self, prefix=None):
        """Load in model and category map

        Args:
            prefix (str): Prefix of directory containing model HDF5 file and category map JSON file
        """
        if prefix != None:
            self.load(prefix)
        else:
            self.prefix = 'models/classifier'
            self.model = None
            self.category_map = {}

    def load(self, prefix=None):
        """Load in model and category map

        Args:
            prefix (str): Prefix of directory containing model HDF5 file and category map JSON file
        """
        if prefix != None: self.prefix = prefix
        self.model = load_model(self.prefix + '.h5')
        self.category_map = json.load(open(self.prefix + '.json', 'r'))

    def save(self, prefix=None):
        """Save in model and category map

        Args:
            prefix (str): Prefix of directory containing model HDF5 file and category map JSON file
        """
        if prefix != None: self.prefix = prefix
        self.model.save(self.prefix + '.h5')
        with open(self.prefix + '.json', 'w') as out:
            json.dump(self.category_map, out)

    def index_categories(self, categories):
        """Take a list of possibly duplicate categories and create an index list

        Args:
            categories (list(str)): List of categories
        Returns:
            list(int): List of indices
        """
        print('Indexing categories...')
        indices = []
        for category in categories:
            if not (category in self.category_map):
                self.category_map[category] = len(self.category_map)
            indices.append(self.category_map[category])
        print(('Found %s unique categories.' % len(self.category_map)))
        return indices

    def classify(self, data):
        """Classify by products by text

        Args:
            data (np.array): 2D array representing descriptions of the product and/or product title
        Returns:
            list(dict(str, float)): List of dictionaries of product categories with associated confidence
        """
        prediction = self.model.predict(data)
        all_category_probs = []
        for i in range(prediction.shape[0]):
            category_probs = {}
            for category in self.category_map:
                category_probs[category] = prediction[i, self.category_map[category]]
            all_category_probs.append(category_probs)
        return all_category_probs

    def get_labels(self, categories):
        """Create labels from a list of categories

        Args:
            categories (list(str)): A list of product categories
        Returns:
            (list(int)): List of indices
        """
        indexed_categories = self.index_categories(categories)
        labels = to_categorical(np.asarray(indexed_categories))
        return labels

    def compile(self, tokenizer, glove_dir='./data/', embedding_dim=100, dropout_fraction=0.0, kernal_size=5,
                n_filters=128):
        """Compile network model for classifier

        Args:
            glove_file (str): Location of GloVe file
            embedding_dim (int): Size of embedding vector
            tokenizer (WordTokenizer): Object used to tokenize orginal texts
            dropout_fraction (float): Fraction of randomly zeroed weights in dropout layer
            kernal_size (int): Size of sliding window for convolution
            n_filters (int): Number of filters to produce from convolution
        """
        # Load embedding layer
        print('Loading GloVe embedding...')
        embeddings_index = {}
        f = open(os.path.join(glove_dir, 'glove.6B.' + str(embedding_dim) + 'd.txt'), 'r')
        for line in f:
            values = line.split()
            word = values[0]
            coefs = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs
        f.close()
        print(('Found %s word vectors.' % len(embeddings_index)))

        # Create embedding layer
        print('Creating embedding layer...')
        embedding_matrix = np.zeros((len(tokenizer.tokenizer.word_index) + 1, embedding_dim))
        for word, i in list(tokenizer.tokenizer.word_index.items()):
            embedding_vector = embeddings_index.get(word)
            if embedding_vector is not None:
                # words not found in embedding index will be all-zeros.
                embedding_matrix[i] = embedding_vector
        embedding_layer = Embedding(len(tokenizer.tokenizer.word_index) + 1,
                                    embedding_dim,
                                    weights=[embedding_matrix],
                                    input_length=tokenizer.max_sequence_length,
                                    trainable=False)

        # Create network
        print('Creating network...')
        sequence_input = Input(shape=(tokenizer.max_sequence_length,), dtype='int32')
        embedded_sequences = embedding_layer(sequence_input)
        x = Dropout(dropout_fraction)(embedded_sequences)
        x = Conv1D(n_filters, kernal_size, activation='relu')(x)
        x = MaxPooling1D(kernal_size)(x)
        x = Conv1D(n_filters, kernal_size, activation='relu')(x)
        x = MaxPooling1D(kernal_size)(x)
        x = Conv1D(n_filters, kernal_size, activation='relu')(x)
        x = MaxPooling1D(int(x.shape[1]))(x)  # global max pooling
        x = Flatten()(x)
        x = Dense(n_filters, activation='relu')(x)
        preds = Dense(len(self.category_map), activation='softmax')(x)

        # Compile model
        print('Compiling network...')
        self.model = Model(sequence_input, preds)
        self.model.compile(loss='categorical_crossentropy',
                           optimizer='rmsprop',
                           metrics=['acc'])

    def train(self, data, labels, validation_split=0.2, batch_size=256, epochs=2):
        """Train classifier

        Args:
            data (np.array): 3D numpy array (n_samples, embedding_dim, tokenizer.max_sequence_length)
            labels (np.array): 2D numpy array (n_samples, len(self.category_map))
            validation_split (float): Fraction of samples to be used for validation
            batch_size (int): Training batch size
            epochs (int): Number of training epochs
        """
        print('Training...')
        # Split the data into a training set and a validation set
        indices = np.arange(data.shape[0])
        np.random.shuffle(indices)
        data = data[indices]
        labels = labels[indices]
        nb_validation_samples = int(validation_split * data.shape[0])

        x_train = data[:-nb_validation_samples]
        y_train = labels[:-nb_validation_samples]
        x_val = data[-nb_validation_samples:]
        y_val = labels[-nb_validation_samples:]

        # Train!
        self.save()
        checkpointer = ModelCheckpoint(filepath=self.prefix + '.h5', verbose=1, save_best_only=False)
        self.model.fit(x_train, y_train, validation_data=(x_val, y_val),
                       callbacks=[checkpointer],
                       nb_epoch=epochs, batch_size=batch_size)
        self.evaluate(x_val, y_val, batch_size)

    def evaluate(self, x_test, y_test, batch_size=256):
        """Evaluate classifier

        Args:
            x_test (np.array): 3D numpy array (n_samples, embedding_dim, tokenizer.max_sequence_length)
            y_test (np.array): 2D numpy array (n_samples, len(self.category_map))
            batch_size (int): Training batch size
        """
        print('Evaluating...')
        predictions_last_epoch = self.model.predict(x_test, batch_size=batch_size, verbose=1)
        predicted_classes = np.argmax(predictions_last_epoch, axis=1)
        target_names = [''] * len(self.category_map)
        for category in self.category_map:
            target_names[self.category_map[category]] = category
        y_val = np.argmax(y_test, axis=1)
        print((classification_report(y_val, predicted_classes, target_names=target_names, digits=6,
                                     labels=range(len(self.category_map)))))


================================================
FILE: data/groups.py
================================================
import csv
import sys
from operator import itemgetter

with open(sys.argv[1], 'r') as f:
    reader = csv.reader(f)
    brands, categories = {}, {}
    count = 0
    for row in reader:
        count += 1
        if not (count % 10000): print(count)
        brand = row[1]
        if brand in brands:
            brands[brand] += 1
        else:
            brands[brand] = 1
        category = row[3].split(' / ')[0]
        if category in categories:
            categories[category] += 1
        else:
            categories[category] = 1
    print(sorted(brands.items(), key=itemgetter(1)))
    print(sorted(categories.items(), key=itemgetter(1)))


================================================
FILE: data/normalize.py
================================================
"""Normalizes product data"""

import csv
import sys


def unescape(s):
    if sys.version_info >= (3, 0):
        import html
        output = html.unescape(str(s))
    else:
        import htmllib

        p = htmllib.HTMLParser(None)
        p.save_bgn()
        try:
            p.feed(s)
        except:
            return s
        output = p.save_end()
    return output


in_file = sys.argv[1]
out_file = '.'.join(in_file.split('.')[:-1] + ['normalized'] + ['csv'])
with open(in_file, 'r') as f:
    reader = csv.reader(f)
    writer = csv.writer(open(out_file, "w"))
    count = 0
    for row in reader:
        count += 1
        if not (count % 10000):
            print(count, 'rows normalized')
        row = [unescape(x).lower().replace('\\n', ' ') for x in row]
        writer.writerow(row)
    print(count, 'rows normalized')


================================================
FILE: data/parse.py
================================================
"""Parses Amazon product metadata found at http://snap.stanford.edu/data/amazon/productGraph/metadata.json.gz"""

import csv, sys, yaml
from yaml import CLoader as Loader


def usage():
    print("""
USAGE: python parse.py metadata.json
""")
    sys.exit(0)


def main(argv):
    if len(argv) < 2:
        usage()
    filename = sys.argv[1]
    with open(filename, 'r') as f:
        count, good, bad = 0, 0, 0
        out = csv.writer(open("products.csv", "w"))
        for line in f:
            count += 1
            if not (count % 100000):
                print("count:", count, "good:", good, ", bad:", bad)
            if ("'title':" in line) and ("'brand':" in line) and ("'categories':" in line):
                try:
                    line = line.rstrip().replace("\\'", "''")
                    product = yaml.load(line, Loader=Loader)
                    title, brand, categories = product['title'], product['brand'], product['categories']
                    description = product['description'] if 'description' in product else ''
                    categories = ' / '.join([item for sublist in categories for item in sublist])
                    out.writerow([title, brand, description, categories])
                    good += 1
                except Exception as e:
                    print(line)
                    print(e)
                    bad += 1
        print("good:", good, ", bad:", bad)


if __name__ == "__main__":
    main(sys.argv)


================================================
FILE: data/supplement.py
================================================
"""Supplements product data"""

import csv
import sys

in_file = sys.argv[1]
out_file = '.'.join(in_file.split('.')[:-1] + ['supplemented'] + ['csv'])
with open(in_file, 'r') as f:
    reader = csv.reader(f)
    writer = csv.writer(open(out_file, "w"))
    count, supplemented = 0, 0
    for row in reader:
        count += 1
        if not (count % 10000):
            print(supplemented, '/', count, 'rows supplemented')
        title, brand, description = row[0], row[1], row[2]
        if not (brand in title):
            supplemented += 1
            title = brand + ' ' + title
        description = title + ' ' + description
        row[0], row[1], row[2] = title, brand, description
        writer.writerow(row)
    print(supplemented, '/', count, 'rows supplemented')


================================================
FILE: data/tag.py
================================================
"""Tags product data"""

import csv
import sys

in_file = sys.argv[1]
out_file = '.'.join(in_file.split('.')[:-1] + ['tagged'] + ['csv'])
with open(in_file, 'r') as f:
    reader = csv.reader(f)
    writer = csv.writer(open(out_file, "w"))
    count = 0
    for row in reader:
        count += 1
        if not (count % 10000):
            print(count, 'rows tagged')
        title, brand, description = row[0], row[1], row[2]
        tagging = ''
        brand = brand.split(' ')
        brand_started = False
        for word in title.split(' '):
            if word == brand[0]:
                tagging += 'B-B '
                brand_started = True
            elif len(brand) > 1 and brand_started:
                for b in brand[1:]:
                    if word == b:
                        tagging += 'I-B '
                    else:
                        brand_started = False
                        tagging += 'O '
            else:
                brand_started = False
                tagging += 'O '
        row.append(tagging)
        writer.writerow(row)
    print(count, 'rows tagged')


================================================
FILE: data/trim.py
================================================
"""Trims product data"""

import csv
import sys

in_file = sys.argv[1]
out_file = '.'.join(in_file.split('.')[:-1] + ['trimmed'] + ['csv'])
with open(in_file, 'r') as f:
    reader = csv.reader(f)
    writer = csv.writer(open(out_file, "w"))
    count, trimmed = 0, 0
    for row in reader:
        try:
            count += 1
            if not (count % 10000):
                print(trimmed, '/', count, 'rows trimmed')
            brand = row[1].lower()
            if brand == 'unknown' or brand == '' or brand == 'generic':
                trimmed += 1
                continue
            writer.writerow(row)
        except:
            print(row)
    print(trimmed, '/', count, 'rows trimmed')


================================================
FILE: extract.py
================================================
"""Script to extract product category specific attributes based on product titles and descriptions
"""

import csv
import os
import sys
from operator import itemgetter

from classifier import ProductClassifier
from ner import ProductNER
from tokenizer import WordTokenizer


def process(row, tokenizer, classifier, ner):
    """Run a row through processing pipeline

    tokenize -> classify
             -> extract attributes

    Args:
        row (dict(str: str)): Dictionary of field name/field value pairs
        tokenizer (WordTokenizer): Word tokenizer
        classifier (ProductClassifier): Product classifier
    Returns:
        dict(str, float): Dictionary of product categories with associated confidence
        list(list(str)): List of pairs of attribute type and attribute value
    """
    # Classify
    data = tokenizer.tokenize([row['name'] + ' ' + row['description']])
    categories = classifier.classify(data)[0]
    row['category'] = max(list(categories.items()), key=itemgetter(1))[0]

    # Extract entities
    data = tokenizer.tokenize([row['name']])
    tags = ner.tag(data)[0]
    brand, brand_started = '', False
    for word, tag in zip(row['name'].split(' '), tags):
        max_tag = max(list(tag.items()), key=itemgetter(1))[0]
        if 'B-B' in max_tag and (not brand_started):
            brand = word
            brand_started = True
        elif 'I-B' in max_tag and brand_started:
            brand += ' ' + word
        else:
            brand_started = False
    row['brand'] = brand

    return row


def usage():
    print("""
USAGE: python extract.py model_dir data_file.csv
FORMAT: "id","name","description","price"
""")
    sys.exit(0)


def main(argv):
    if len(argv) < 3:
        usage()
    model_dir = sys.argv[1]
    data_file = sys.argv[2]

    # Load tokenizer
    tokenizer = WordTokenizer()
    tokenizer.load(os.path.join(model_dir, 'tokenizer'))

    # Load classifier
    classifier = ProductClassifier()
    classifier.load(os.path.join(model_dir, 'classifier'))

    # Load named entity recognizer
    ner = ProductNER()
    ner.load(os.path.join(model_dir, 'ner'))

    with open(data_file, 'r', encoding="iso-8859-1") as f:
        reader = csv.DictReader(f)
        with open('.'.join(data_file.split('.')[:-1] + ['processed', 'csv']), 'w', encoding="utf-8") as outfile:
            writer = csv.DictWriter(outfile, fieldnames=reader.fieldnames + ['category', 'brand'])
            writer.writeheader()
            count = 0
            for row in reader:
                count += 1
                processed_row = process(row, tokenizer, classifier, ner)
                print(processed_row)
                writer.writerow(processed_row)


if __name__ == "__main__":
    main(sys.argv)


================================================
FILE: ner.py
================================================
"""Named entity recognition class"""

import json
import os

import numpy as np
from keras.callbacks import ModelCheckpoint
from keras.layers import Dense, Embedding, LSTM, Bidirectional, TimeDistributed, Activation
from keras.models import load_model, Sequential
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical
from sklearn.metrics import classification_report


class ProductNER(object):
    """Class which recognizes named entities

       Attributes:
           prefix (str): Model files prefix
           model (keras.model): Keras model
           tag_map (dict(str, int)): Map between tag names and indices
    """

    def __init__(self, prefix=None):
        """Load in model and tag map

        Args:
            prefix (str): Prefix of directory containing model HDF5 file and tag map JSON file
        """
        if prefix != None:
            self.load(prefix)
        else:
            self.prefix = 'models/ner'
            self.model = None
            self.tag_map = {}

    def load(self, prefix=None):
        """Load in model and tag map

        Args:
            prefix (str): Prefix of directory containing model HDF5 file and tag map JSON file
        """
        if prefix != None: self.prefix = prefix
        self.model = load_model(self.prefix+'.h5')
        self.tag_map = json.load(open(self.prefix+'.json', 'r'))

    def save(self, prefix=None):
        """Save in model and tag map

        Args:
            prefix (str): Prefix of directory containing model HDF5 file and tag map JSON file
        """
        if prefix != None: self.prefix = prefix
        self.model.save(self.prefix+'.h5')
        with open(self.prefix+'.json', 'w') as out:
            json.dump(self.tag_map, out)

    def tag(self, data):
        """Return all named entities given some embedded text

        Args:
            data (np.array): 2D array representing descriptions of the product and/or product title
        Returns:
            list(list(dict(str, float))): List of lists of entities
        """
        prediction = self.model.predict(data)
        all_tag_probs = []
        for i in range(prediction.shape[0]):
            sentence_tag_probs = []
            first_word = 0
            for j in range(data[i].shape[0]):
                if data[i,j] != 0: break
                first_word += 1
            for j in range(first_word, prediction.shape[1]):
                word_tag_probs = {}
                for tag in self.tag_map:
                    word_tag_probs[tag] = prediction[i,j,self.tag_map[tag]]
                sentence_tag_probs.append(word_tag_probs)
            all_tag_probs.append(sentence_tag_probs)
        return all_tag_probs

    def index_tags(self, tags):
        """Take a list of possibly duplicate tags and create an index list

        Args:
            tags (list(str)): List of tags
        Returns:
            list(int): List of indices
        """
        indices = []
        for tag in tags:
            if not (tag in self.tag_map):
                self.tag_map[tag] = len(self.tag_map) + 1
            indices.append(self.tag_map[tag])
        return indices

    def get_labels(self, tag_sets):
        """Create labels from a list of tag_sets

        Args:
            tag_sets (list(list(str))): A list of word tag sets
        Returns:
            (list(list(int))): List of list of indices
        """
        labels = []
        print('Getting labels...')
        for tag_set in tag_sets:
            indexed_tags = self.index_tags(tag_set)
            labels.append(to_categorical(np.asarray(indexed_tags), nb_classes=4))
        labels = pad_sequences(labels, maxlen=200)
        return labels

    def compile(self, tokenizer, glove_dir='./data/', embedding_dim=200, dropout_fraction=0.2, hidden_dim=32):
        """Compile network model for NER

        Args:
            glove_file (str): Location of GloVe file
            embedding_dim (int): Size of embedding vector
            tokenizer (WordTokenizer): Object used to tokenize orginal texts
            dropout_fraction (float): Fraction of randomly zeroed weights in dropout layer
            hidden_dim (int): Hidden dimension
        """
        # Load embedding layer
        print('Loading GloVe embedding...')
        embeddings_index = {}
        f = open(os.path.join(glove_dir, 'glove.6B.'+str(embedding_dim)+'d.txt'), 'r')
        for line in f:
            values = line.split()
            word = values[0]
            coefs = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs
        f.close()
        print(('Found %s word vectors.' % len(embeddings_index)))

        # Create embedding layer
        print('Creating embedding layer...')
        embedding_matrix = np.zeros((len(tokenizer.tokenizer.word_index) + 1, embedding_dim))
        for word, i in list(tokenizer.tokenizer.word_index.items()):
            embedding_vector = embeddings_index.get(word)
            if embedding_vector is not None:
                # words not found in embedding index will be all-zeros.
                embedding_matrix[i] = embedding_vector

        # Create network
        print('Creating network...')
        self.model = Sequential()
        self.model.add(Embedding(len(tokenizer.tokenizer.word_index) + 1,
                                 embedding_dim,
                                 weights=[embedding_matrix],
                                 input_length=tokenizer.max_sequence_length,
                                 trainable=False,
                                 mask_zero=True))
        self.model.add(Bidirectional(LSTM(hidden_dim, return_sequences=True)))
        self.model.add(TimeDistributed(Dense(len(self.tag_map) + 1)))
        self.model.add(Activation('softmax'))

        # Compile model
        print('Compiling network...')
        self.model.compile(loss='categorical_crossentropy',
                           optimizer='adam',
                           metrics=['acc'])

    def train(self, data, labels, validation_split=0.2, batch_size=256, epochs=2):
        """Train ner

        Args:
            data (np.array): 3D numpy array (n_samples, embedding_dim, tokenizer.max_sequence_length)
            labels (np.array): 3D numpy array (n_samples, tokenizer.max_sequence_length, len(self.tag_map))
            validation_split (float): Fraction of samples to be used for validation
            batch_size (int): Training batch size
            epochs (int): Number of training epochs
        """
        print('Training...')
        # Split the data into a training set and a validation set
        indices = np.arange(data.shape[0])
        np.random.shuffle(indices)
        data = data[indices]
        labels = labels[indices]
        nb_validation_samples = int(validation_split * data.shape[0])

        x_train = data[:-nb_validation_samples]
        y_train = labels[:-nb_validation_samples]
        x_val = data[-nb_validation_samples:]
        y_val = labels[-nb_validation_samples:]

        print(data.shape, labels.shape)

        # Train!
        self.save()
        checkpointer = ModelCheckpoint(filepath=self.prefix+'.h5', verbose=1, save_best_only=False)
        self.model.fit(x_train, y_train, validation_data=(x_val, y_val),
                       callbacks=[checkpointer],
                       nb_epoch=epochs, batch_size=batch_size)
        self.evaluate(x_val, y_val, batch_size)

    def evaluate(self, x_test, y_test, batch_size=256):
        """Evaluate classifier

        Args:
            x_test (np.array): 2D numpy array (n_samples, tokenizer.max_sequence_length)
            y_test (np.array): 3D numpy array (n_samples, tokenizer.max_sequence_length, len(self.tag_map))
            batch_size (int): Training batch size
        """
        print('Evaluating...')
        predictions_last_epoch = self.model.predict(x_test, batch_size=batch_size, verbose=1)
        predicted_classes = np.argmax(predictions_last_epoch, axis=2).flatten()
        y_val = np.argmax(y_test, axis=2).flatten()
        target_names = ['']*(max(self.tag_map.values())+1)
        for category in self.tag_map:
            target_names[self.tag_map[category]] = category

        print((classification_report(y_val, predicted_classes, target_names=target_names, digits = 6, labels=range(len(target_names)))))


================================================
FILE: tokenizer.py
================================================
"""Word tokenizer class"""

import os
import numpy as np

try:
    import cPickle as pickle
except:
    import pickle
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences


class WordTokenizer(object):
    """Class which tokenizes words

    Attributes:
        max_sequence_length (int): Maximum sequence length for embedding
        tokenizer (Tokenizer): Keras Tokenizer
        prefix (str): Prefix for tokenizer save file
    """

    def __init__(self, max_sequence_length=200, prefix="./models/tokenizer"):
        """Create tokenizer

        Args:
            max_sequence_length (int): Maximum sequence length for texts
            prefix (str): Prefix for tokenizer save file
        """
        self.max_sequence_length = max_sequence_length
        self.prefix = prefix
        self.tokenizer = None

    def save(self, prefix=None):
        """Saves the tokenizer

        Args:
            prefix (str): Prefix for tokenizer save file
        """
        if prefix != None: self.prefix = prefix
        pickle.dump(self.tokenizer, open(self.prefix + ".pickle", "wb"))

    def load(self, prefix=None):
        """Loads the tokenizer
        """
        if prefix != None: self.prefix = prefix
        self.tokenizer = pickle.load(open(self.prefix + ".pickle", "rb"))

    def train(self, texts, max_nb_words=80000):
        """Takes a list of texts, fits a tokenizer to them, and creates the embedding matrix.

        Args:
            texts (list(str)): List of texts
            max_nb_words: Maximum number of words indexed (take most frequently used)
        """
        # Tokenize
        print('Training tokenizer...')
        self.tokenizer = Tokenizer(nb_words=max_nb_words)
        self.tokenizer.fit_on_texts(texts)
        self.save()
        print(('Found %s unique tokens.' % len(self.tokenizer.word_index)))

    def tokenize(self, texts):
        """Takes a list of texts and tokenizes them.

        Args:
            texts (list(str)): List of texts
        Returns:
            np.array: 2D numpy array (len(texts), self.max_sequence_length)
        """
        sequences = self.tokenizer.texts_to_sequences(texts)
        data = pad_sequences(sequences, maxlen=self.max_sequence_length)
        return data


================================================
FILE: train_classifier.py
================================================
"""Script to train a product category classifier based on product titles and descriptions
"""

import csv
import sys

from classifier import ProductClassifier
from tokenizer import WordTokenizer

MAX_TEXTS = 1000000


def usage():
    print("""
USAGE: python train_classifier.py data_file.csv
FORMAT: "title","brand","description","categories"
""")
    sys.exit(0)


def main(argv):
    if len(argv) < 2:
        usage()

    # Fetch data
    texts, categories = [], []
    with open(sys.argv[1], 'r') as f:
        reader = csv.DictReader(f, fieldnames=["title", "brand", "description", "categories"])
        count = 0
        for row in reader:
            count += 1
            text, category = row['description'], row['categories'].split(' / ')[0]
            texts.append(text)
            categories.append(category)
            if count >= MAX_TEXTS:
                break
    print(('Processed %s texts.' % len(texts)))

    # Tokenize texts
    tokenizer = WordTokenizer()
    tokenizer.load()
    data = tokenizer.tokenize(texts)

    # Get labels from classifier
    classifier = ProductClassifier()
    labels = classifier.get_labels(categories)

    # Compile classifier network and train
    classifier.compile(tokenizer)
    classifier.train(data, labels, epochs=2)


if __name__ == "__main__":
    main(sys.argv)


================================================
FILE: train_ner.py
================================================
"""Script to train a product category ner based on product titles and descriptions
"""

import csv
import sys

from ner import ProductNER
from tokenizer import WordTokenizer

MAX_TEXTS = 1000000


def usage():
    print("""
USAGE: python train_ner.py data_file.csv
FORMAT: "title","brand","description","categories"
""")
    sys.exit(0)


def main(argv):
    if len(argv) < 2:
        usage()

    # Fetch data
    texts, tags = [], []
    with open(sys.argv[1], 'r') as f:
        reader = csv.DictReader(f, fieldnames=["title", "brand", "description", "categories", "tags"])
        count = 0
        for row in reader:
            count += 1
            text, tag_set = row['title'], row['tags'].split(' ')[:-1]
            texts.append(text)
            tags.append(tag_set)
            if count >= MAX_TEXTS:
                break
    print(('Processed %s texts.' % len(texts)))

    # Tokenize texts
    tokenizer = WordTokenizer()
    tokenizer.load()
    data = tokenizer.tokenize(texts)

    # Get labels from NER
    ner = ProductNER()
    labels = ner.get_labels(tags)

    # Compile NER network and train
    ner.compile(tokenizer)
    ner.train(data, labels, epochs=2)


if __name__ == "__main__":
    main(sys.argv)


================================================
FILE: train_tokenizer.py
================================================
"""Script to train a word tokenizer
"""

import csv
import sys

from tokenizer import WordTokenizer

MAX_TEXTS = 1000000


def usage():
    print("""
USAGE: python train_tokenizer.py data_file.csv
FORMAT: "title","brand","description","categories"
""")
    sys.exit(0)


def main(argv):
    if len(argv) < 2:
        usage()

    # Fetch data
    texts, categories = [], []
    with open(sys.argv[1], 'r') as f:
        reader = csv.DictReader(f, fieldnames=["title", "brand", "description", "categories"])
        count = 0
        for row in reader:
            count += 1
            text, category = row['title'] + ' ' + row['description'], row['categories'].split(' / ')[0]
            texts.append(text)
            categories.append(category)
            if count >= MAX_TEXTS:
                break
    print(('Processed %s texts.' % len(texts)))

    # Tokenize texts
    tokenizer = WordTokenizer()
    tokenizer.train(texts)


if __name__ == "__main__":
    main(sys.argv)