Repository: abhishekkrthakur/clickbaits_revisited
Branch: master
Commit: de9020635c58
Files: 14
Total size: 30.7 KB

Directory structure:
gitextract_ut2cqja5/

├── .gitignore
├── LICENSE
├── README.md
├── data/
│   └── .keep
├── data_processing/
│   ├── create_data.py
│   ├── data_cleaning.py
│   ├── feature_generation.py
│   ├── html_scraper.py
│   └── merge_data.py
└── deepnets/
    ├── LSTM_Title_Content.py
    ├── LSTM_Title_Content_Numerical_with_GloVe.py
    ├── LSTM_Title_Content_with_GloVe.py
    ├── LSTM_Titles.py
    └── TDD_Title_Content_with_GloVe.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
.idea/**
*.pyc
data/*.pkl
data/*.csv
data/*.zip


================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) 2017 Abhishek Thakur

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: README.md
================================================
# Clickbaits Revisited

This repository provides the code used for : https://www.linkedin.com/pulse/clickbaits-revisited-deep-learning-title-content-features-thakur


### Data Collection
To run the code you must first collect the data:

- Get facebook page parser from: https://github.com/minimaxir/facebook-page-post-scraper
- Run the python script: get_fb_posts_fb_page.py for buzzfeed, upworthy, cnn, nytimes, wikinews, clickhole and StopClickBaitOfficial
- Save all the CSVs obtained from above step in data/


### Data Pre-Processing
After the data has been collected, you need to run the following files to obtain training and test data. The order is important!

    - $ cd data_processing
    - $ python create_data.py
    - $ python html_scraper.py
    - $ python feature_generation.py
    - $ python merge_data.py
    - $ python data_cleaning.py
 
After the steps above, you will end up with train.csv and test.csv in data/

Please note that the above steps will require a lot of memory. So, if you have anything less than 64GB, please modify the code according to your needs.

### GloVe embeddings

Obtain GloVe embeddings from the following URL:

    http://nlp.stanford.edu/data/glove.840B.300d.zip
    
Extract the zip and place the CSV in data/


### Deepnets

After all the above steps, you are ready to go and play around with the deep neural networks to classify clickbaits

Change directory to deepnets/
    
    cd deepnets/
     
The deepnets are as folllows:
    
    LSTM_Title.py : LSTM on title text without GloVe embeddings
    LSTM_Title_Content.py : LSTM on title text and content text without GloVe embeddings
    LSTM_Title_Content_with_GloVe.py : LSTM on title and content text with GloVe emebeddings
    TDD_Title_Content_with_Glove.py : Time distributed dense on title and content text with GloVe embeddings
    LSTM_Title_Content_Numerical_with_GloVe.py : LSTM on title + content text with GloVe embeddings & dense net for numerical features.
     

### Performance

The network with LSTM on title and content text with GloVe embeddings with numerical features achieves an accuracy of 0.996 during validation and 0.992 on the test set.

All models were trained on NVIDIA TitanX, Ubuntu 16.04 system with 64GB memory.


================================================
FILE: data/.keep
================================================


================================================
FILE: data_processing/create_data.py
================================================
# coding: utf-8
"""
Create usable data after scraping public facebook pages
@author: Abhishek Thakur
"""

import pandas as pd

buzzfeed = pd.read_csv('../data/buzzfeed_facebook_statuses.csv',
                       usecols=['link_name', 'status_type', 'status_link'])

clickhole = pd.read_csv('../data/clickhole_facebook_statuses.csv',
                        usecols=['link_name', 'status_type', 'status_link'])

cnn = pd.read_csv('../data/cnn_facebook_statuses.csv',
                  usecols=['link_name', 'status_type', 'status_link'])

nytimes = pd.read_csv('../data/nytimes_facebook_statuses.csv',
                      usecols=['link_name', 'status_type', 'status_link'])

stopclickbait = pd.read_csv('../data/StopClickBaitOfficial_facebook_statuses.csv',
                            usecols=['link_name', 'status_type', 'status_link'])

upworthy = pd.read_csv('../data/Upworthy_facebook_statuses.csv',
                       usecols=['link_name', 'status_type', 'status_link'])

wikinews = pd.read_csv('../data/wikinews_facebook_statuses.csv',
                       usecols=['link_name', 'status_type', 'status_link'])

wikinews.link_name = wikinews.link_name.apply(lambda x: str(x).replace(' - Wikinews, the free news source', ''))
buzzfeed = buzzfeed[buzzfeed.status_type == 'link']
clickhole = clickhole[clickhole.status_type == 'link']
cnn = cnn[cnn.status_type == 'link']
nytimes = nytimes[nytimes.status_type == 'link']
stopclickbait = stopclickbait[stopclickbait.status_type == 'link']
upworthy = upworthy[upworthy.status_type == 'link']
wikinews = wikinews[wikinews.status_type == 'link']

cnn = cnn.sample(frac=1).head(10000)
nytimes = nytimes.sample(frac=1).head(13000)

clickbaits = pd.concat([buzzfeed, clickhole, stopclickbait, upworthy])
non_clickbaits = pd.concat([cnn, nytimes, wikinews])

clickbaits.to_csv('../data/clickbaits.csv', index=False)
non_clickbaits.to_csv('../data/non_clickbaits.csv', index=False)


================================================
FILE: data_processing/data_cleaning.py
================================================
# coding: utf-8
"""
Cleans the data more and separates into training and test sets
@author: Abhishek Thakur
"""

import pandas as pd
from sklearn.cross_validation import train_test_split

internet_stop_words = ['site', 'navigation', 'new', 'times', 'york', 'information', 'index',
                       'like', 'related', 'search', 'follow', 'subscribe', 'subscribed', 'subscribing',
                       'spam', 'twitter', 'pinterest', 'facebook', 'google', 'privacy', 'policy', 'feedback',
                       'tweet', 'tweets', 'disclaimer', 'buzzfeed', 'clickhole', 'upworthy', 'cnn', 'nytimes',
                       'wikinews', 'instagram', 'newsletter', 'copyright', 'cnn.com', 'nytimes.com',
                       'buzzfeed.com', 'upworthy.com', 'clickhole.com', 'wikinews.com']


def remove_internet_stop_words(x):
    return ' '.join([word for word in str(x).lower().split() if word not in internet_stop_words])


df = pd.read_csv('../data/fulldata.csv')
df = df.drop_duplicates()

df.textdata = df.textdata.apply(lambda x: str(x).replace('report an issue thanks', '').strip())

df.textdata = df.textdata.apply(remove_internet_stop_words)
df.link_name = df.link_name.apply(remove_internet_stop_words)

df = df.drop(['status_type', 'status_link'], axis=1)

train_df, test_df = train_test_split(df, stratify=df.label.values, random_state=42, test_size=0.1)

train_df.to_csv('../data/train.csv', index=False, encoding='utf-8')
test_df.to_csv('../data/test.csv', index=False, encoding='utf-8')


================================================
FILE: data_processing/feature_generation.py
================================================
# coding: utf-8
"""
Generate numerical content based and text features from HTMLs
@author: Abhishek Thakur
"""

import pandas as pd
import cPickle
from bs4 import BeautifulSoup
from goose import Goose
from collections import Counter
import string
from joblib import Parallel, delayed
import sys
from tqdm import tqdm

stop_domains = ['buzzfeed', 'clickhole', 'cnn', 'wikinews', 'upworthy', 'nytimes']


def features(html):
    try:
        soup = BeautifulSoup(html, "lxml")
        g = Goose()
        try:
            goose_article = g.extract(raw_html=html)
        except TypeError:
            goose_article = None
        except IndexError:
            goose_article = None

        size = sys.getsizeof(html)
        html_len = len(html)
        number_of_links = len(soup.find_all('a'))
        number_of_buttons = len(soup.find_all('button'))
        number_of_inputs = len(soup.find_all('input'))
        number_of_ul = len(soup.find_all('ul'))
        number_of_ol = len(soup.find_all('ol'))
        number_of_lists = number_of_ol + number_of_ul
        number_of_h1 = len(soup.find_all('h1'))
        number_of_h2 = len(soup.find_all('h2'))
        if number_of_h1 > 0:
            h1_len = 0
            h1_text = ''
            for x in soup.find_all('h1'):
                text = x.get_text().strip()
                h1_text += text + ' '
                h1_len += len(text)
            total_h1_len = h1_len
            avg_h1_len = h1_len * 1. / number_of_h1
        else:
            total_h1_len = 0
            avg_h1_len = 0
            h1_text = ''

        if number_of_h2 > 0:
            h2_len = 0
            h2_text = ''
            for x in soup.find_all('h2'):
                text = x.get_text().strip()
                h2_len += len(text)
                h2_text += text + ' '
            total_h2_len = h2_len
            avg_h2_len = h2_len * 1. / number_of_h2
        else:
            total_h2_len = 0
            avg_h2_len = 0
            h2_text = ''
        if goose_article is not None:
            textdata = goose_article.meta_description + ' ' + h1_text + ' ' + h2_text
            textdata = "".join(l for l in textdata if l not in string.punctuation)
            textdata = textdata.strip().lower().split()
            textdata = [word for word in textdata if word.lower() not in stop_domains]
            textdata = ' '.join(textdata)
        else:
            textdata = h1_text + ' ' + h2_text
            textdata = "".join(l for l in textdata if l not in string.punctuation)
            textdata = textdata.strip().lower().split()
            textdata = [word for word in textdata if word.lower() not in stop_domains]
            textdata = ' '.join(textdata)

        number_of_images = len(soup.find_all('img'))

        number_of_tags = len([x.name for x in soup.find_all()])
        number_of_unique_tags = len(Counter([x.name for x in soup.find_all()]))

        return [size, html_len, number_of_links, number_of_buttons,
                number_of_inputs, number_of_ul, number_of_ol, number_of_lists,
                number_of_h1, number_of_h2, total_h1_len, total_h2_len, avg_h1_len, avg_h2_len,
                number_of_images, number_of_tags, number_of_unique_tags,
                textdata]
    except:
        return [-1, -1, -1, -1,
                -1, -1, -1, -1,
                -1, -1, -1, -1, -1, -1,
                -1, -1, -1,
                "no data"]


clickbait_html = cPickle.load(open('../data/clickbait_html.pkl'))
clickbait_features = Parallel(n_jobs=50)(delayed(features)(html) for html in tqdm(clickbait_html))

clickbait_features_df = pd.DataFrame(clickbait_features,
                                     columns=["size", "html_len", "number_of_links", "number_of_buttons",
                                              "number_of_inputs", "number_of_ul", "number_of_ol", "number_of_lists",
                                              "number_of_h1", "number_of_h2", "total_h1_len", "total_h2_len",
                                              "avg_h1_len", "avg_h2_len",
                                              "number_of_images", "number_of_tags", "number_of_unique_tags",
                                              "textdata"])

clickbait_features_df.to_csv('../data/clickbait_website_features.csv', index=False, encoding='utf-8')

non_clickbait_html = cPickle.load(open('../data/non_clickbait_html.pkl'))
non_clickbait_features = Parallel(n_jobs=50)(delayed(features)(html) for html in tqdm(non_clickbait_html))

non_clickbait_features_df = pd.DataFrame(non_clickbait_features,
                                         columns=["size", "html_len", "number_of_links", "number_of_buttons",
                                                  "number_of_inputs", "number_of_ul", "number_of_ol", "number_of_lists",
                                                  "number_of_h1", "number_of_h2", "total_h1_len", "total_h2_len",
                                                  "avg_h1_len", "avg_h2_len",
                                                  "number_of_images", "number_of_tags", "number_of_unique_tags",
                                                  "textdata"])

non_clickbait_features_df.to_csv('../data/non_clickbait_website_features.csv', index=False, encoding='utf-8')


================================================
FILE: data_processing/html_scraper.py
================================================
# coding: utf-8
"""
Scrape and save html for all links in clickbait and non_clickbait CSVs
@author: Abhishek Thakur
"""
import sys
reload(sys)
sys.setdefaultencoding('UTF8')

import pandas as pd
import requests
from joblib import Parallel, delayed
import cPickle
from tqdm import tqdm


def html_extractor(url):
    try:
        cookies = dict(cookies_are='working')
        r = requests.get(url, cookies=cookies)
        return r.text
    except:
        return "no html"


clickbaits = pd.read_csv('../data/clickbaits.csv')
non_clickbaits = pd.read_csv('../data/non_clickbaits.csv')

clickbait_urls = clickbaits.status_link.values
non_clickbait_urls = non_clickbaits.status_link.values


clickbait_html = Parallel(n_jobs=20)(delayed(html_extractor)(u) for u in tqdm(clickbait_urls))
cPickle.dump(clickbait_html, open('../data/clickbait_html.pkl', 'wb'), -1)

non_clickbait_html = Parallel(n_jobs=20)(delayed(html_extractor)(u) for u in tqdm(non_clickbait_urls))
cPickle.dump(non_clickbait_html, open('../data/non_clickbait_html.pkl', 'wb'), -1)


================================================
FILE: data_processing/merge_data.py
================================================
# coding: utf-8
"""
Merge original clickbait CSVs with features
@author: Abhishek Thakur
"""
import pandas as pd

clickbait_titles = pd.read_csv('../data/clickbaits.csv')
non_clickbait_titles = pd.read_csv('../data/non_clickbaits.csv')

clickbait_features = pd.read_csv('../data/clickbait_website_features.csv')
non_clickbait_features = pd.read_csv('../data/non_clickbait_website_features.csv')

clickbait_full = pd.concat([clickbait_titles, clickbait_features], axis=1)
non_clickbait_full = pd.concat([non_clickbait_titles, non_clickbait_features], axis=1)

clickbait_full['label'] = 1
non_clickbait_full['label'] = 0

fulldata = pd.concat([clickbait_full, non_clickbait_full])
fulldata = fulldata.sample(frac=1).reset_index(drop=True)
fulldata = fulldata[fulldata.html_len != -1]

fulldata.to_csv('../data/fulldata.csv', index=False)


================================================
FILE: deepnets/LSTM_Title_Content.py
================================================
# coding: utf-8
"""
LSTM on title and content text
@author: Abhishek Thakur
"""

import pandas as pd
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
from keras.layers.normalization import BatchNormalization
from keras.utils import np_utils
from keras.engine.topology import Merge
from keras.callbacks import ModelCheckpoint
from keras.layers.advanced_activations import PReLU
from keras.preprocessing import sequence, text

train = pd.read_csv('../data/train.csv')
test = pd.read_csv('../data/test.csv')

y_train = train.label.values
y_test = test.label.values

tk = text.Tokenizer(nb_words=200000)

train.link_name = train.link_name.astype(str)
test.link_name = test.link_name.astype(str)

train.textdata = train.textdata.astype(str)
test.textdata = test.textdata.astype(str)

max_len = 80

tk.fit_on_texts(list(train.link_name.values) + list(train.textdata.values) + list(test.link_name.values) + list(
    test.textdata.values))
x_train_title = tk.texts_to_sequences(train.link_name.values)
x_train_title = sequence.pad_sequences(x_train_title, maxlen=max_len)

x_train_textdata = tk.texts_to_sequences(train.textdata.values)
x_train_textdata = sequence.pad_sequences(x_train_textdata, maxlen=max_len)

x_test_title = tk.texts_to_sequences(test.link_name.values)
x_test_title = sequence.pad_sequences(x_test_title, maxlen=max_len)

x_test_textdata = tk.texts_to_sequences(test.textdata.values)
x_test_textdata = sequence.pad_sequences(x_test_textdata, maxlen=max_len)

word_index = tk.word_index
ytrain_enc = np_utils.to_categorical(y_train)

model1 = Sequential()
model1.add(Embedding(len(word_index) + 1, 300, input_length=80, dropout=0.2))
model1.add(LSTM(300, dropout_W=0.2, dropout_U=0.2))

model2 = Sequential()
model2.add(Embedding(len(word_index) + 1, 300, input_length=80, dropout=0.2))
model2.add(LSTM(300, dropout_W=0.2, dropout_U=0.2))

merged_model = Sequential()
merged_model.add(Merge([model1, model2], mode='concat'))
merged_model.add(BatchNormalization())

merged_model.add(Dense(200))
merged_model.add(PReLU())
merged_model.add(Dropout(0.2))
merged_model.add(BatchNormalization())

merged_model.add(Dense(200))
merged_model.add(PReLU())
merged_model.add(Dropout(0.2))
merged_model.add(BatchNormalization())

merged_model.add(Dense(2))
merged_model.add(Activation('softmax'))

merged_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy', 'precision', 'recall'])

checkpoint = ModelCheckpoint('../data/weights.h5', monitor='val_acc', save_best_only=True, verbose=2)

merged_model.fit([x_train_title, x_train_textdata], y=ytrain_enc,
                 batch_size=128, nb_epoch=200, verbose=2, validation_split=0.1,
                 shuffle=True, callbacks=[checkpoint])


================================================
FILE: deepnets/LSTM_Title_Content_Numerical_with_GloVe.py
================================================
# coding: utf-8
"""
LSTM on title + content text + numerical features with GloVe embeddings
@author: Abhishek Thakur
"""

import pandas as pd
import numpy as np
from tqdm import tqdm
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
from keras.layers.normalization import BatchNormalization
from keras.utils import np_utils
from keras.engine.topology import Merge
from keras.callbacks import ModelCheckpoint
from keras.layers.advanced_activations import PReLU
from keras.preprocessing import sequence, text
from sklearn import preprocessing

train = pd.read_csv('../data/train_v2.csv')
test = pd.read_csv('../data/test_v2.csv')

y_train = train.label.values
y_test = test.label.values

train_num = train[["size", "html_len", "number_of_links", "number_of_buttons",
                   "number_of_inputs", "number_of_ul", "number_of_ol", "number_of_lists",
                   "number_of_h1", "number_of_h2", "total_h1_len", "total_h2_len", "avg_h1_len", "avg_h2_len",
                   "number_of_images", "number_of_tags", "number_of_unique_tags"]].values

test_num = test[["size", "html_len", "number_of_links", "number_of_buttons",
                 "number_of_inputs", "number_of_ul", "number_of_ol", "number_of_lists",
                 "number_of_h1", "number_of_h2", "total_h1_len", "total_h2_len", "avg_h1_len", "avg_h2_len",
                 "number_of_images", "number_of_tags", "number_of_unique_tags"]].values

tk = text.Tokenizer(nb_words=200000)
train.link_name = train.link_name.astype(str)
test.link_name = test.link_name.astype(str)
train.textdata = train.textdata.astype(str)
test.textdata = test.textdata.astype(str)

max_len = 80

tk.fit_on_texts(list(train.link_name.values) + list(train.textdata.values) + list(test.link_name.values) + list(
    test.textdata.values))
x_train_title = tk.texts_to_sequences(train.link_name.values)
x_train_title = sequence.pad_sequences(x_train_title, maxlen=max_len)

x_train_textdata = tk.texts_to_sequences(train.textdata.values)
x_train_textdata = sequence.pad_sequences(x_train_textdata, maxlen=max_len)

x_test_title = tk.texts_to_sequences(test.link_name.values)
x_test_title = sequence.pad_sequences(x_test_title, maxlen=max_len)

x_test_textdata = tk.texts_to_sequences(test.textdata.values)
x_test_textdata = sequence.pad_sequences(x_test_textdata, maxlen=max_len)

word_index = tk.word_index
ytrain_enc = np_utils.to_categorical(y_train)

embeddings_index = {}
f = open('../data/glove.840B.300d.txt')
for line in tqdm(f):
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

embedding_matrix = np.zeros((len(word_index) + 1, 300))
for word, i in tqdm(word_index.items()):
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

scl = preprocessing.StandardScaler()
train_num_scl = scl.fit_transform(train_num)
test_num_scl = scl.transform(test_num)

model1 = Sequential()
model1.add(Embedding(len(word_index) + 1,
                     300,
                     weights=[embedding_matrix],
                     input_length=80,
                     trainable=False))
model1.add(LSTM(300, dropout_W=0.2, dropout_U=0.2))

model2 = Sequential()
model2.add(Embedding(len(word_index) + 1,
                     300,
                     weights=[embedding_matrix],
                     input_length=80,
                     trainable=False))
model2.add(LSTM(300, dropout_W=0.2, dropout_U=0.2))

model3 = Sequential()
model3.add(Dense(100, input_dim=train_num_scl.shape[1]))
model3.add(PReLU())
model3.add(Dropout(0.2))
model3.add(BatchNormalization())

model3.add(Dense(100))
model3.add(PReLU())
model3.add(Dropout(0.2))
model3.add(BatchNormalization())

merged_model = Sequential()
merged_model.add(Merge([model1, model2, model3], mode='concat'))
merged_model.add(BatchNormalization())

merged_model.add(Dense(200))
merged_model.add(PReLU())
merged_model.add(Dropout(0.2))
merged_model.add(BatchNormalization())

merged_model.add(Dense(200))
merged_model.add(PReLU())
merged_model.add(Dropout(0.2))
merged_model.add(BatchNormalization())

merged_model.add(Dense(2))
merged_model.add(Activation('softmax'))

merged_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy', 'precision', 'recall'])

checkpoint = ModelCheckpoint('../data/weights_title+content_tdd.h5', monitor='val_acc', save_best_only=True, verbose=2)

merged_model.fit([x_train_title, x_train_textdata, train_num_scl], y=ytrain_enc,
                 batch_size=128, nb_epoch=200, verbose=2, validation_split=0.1,
                 shuffle=True, callbacks=[checkpoint])


================================================
FILE: deepnets/LSTM_Title_Content_with_GloVe.py
================================================
# coding: utf-8
"""
LSTM with title+content text with GloVe embeddings
@author: Abhishek Thakur
"""
import pandas as pd
import numpy as np
from tqdm import tqdm
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
from keras.layers.normalization import BatchNormalization
from keras.utils import np_utils
from keras.engine.topology import Merge
from keras.callbacks import ModelCheckpoint
from keras.layers.advanced_activations import PReLU
from keras.preprocessing import sequence, text

train = pd.read_csv('../data/train.csv')
test = pd.read_csv('../data/test.csv')

y_train = train.label.values
y_test = test.label.values

tk = text.Tokenizer(nb_words=200000)
train.link_name = train.link_name.astype(str)
test.link_name = test.link_name.astype(str)

train.textdata = train.textdata.astype(str)
test.textdata = test.textdata.astype(str)

max_len = 80

tk.fit_on_texts(list(train.link_name.values) + list(train.textdata.values) + list(test.link_name.values) + list(
    test.textdata.values))
x_train_title = tk.texts_to_sequences(train.link_name.values)
x_train_title = sequence.pad_sequences(x_train_title, maxlen=max_len)

x_train_textdata = tk.texts_to_sequences(train.textdata.values)
x_train_textdata = sequence.pad_sequences(x_train_textdata, maxlen=max_len)

x_test_title = tk.texts_to_sequences(test.link_name.values)
x_test_title = sequence.pad_sequences(x_test_title, maxlen=max_len)

x_test_textdata = tk.texts_to_sequences(test.textdata.values)
x_test_textdata = sequence.pad_sequences(x_test_textdata, maxlen=max_len)

word_index = tk.word_index
ytrain_enc = np_utils.to_categorical(y_train)

embeddings_index = {}
f = open('../data/glove.840B.300d.txt')
for line in tqdm(f):
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

embedding_matrix = np.zeros((len(word_index) + 1, 300))
for word, i in tqdm(word_index.items()):
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

model1 = Sequential()
model1.add(Embedding(len(word_index) + 1,
                     300,
                     weights=[embedding_matrix],
                     input_length=80,
                     trainable=False))
model1.add(LSTM(300, dropout_W=0.2, dropout_U=0.2))

model2 = Sequential()
model2.add(Embedding(len(word_index) + 1,
                     300,
                     weights=[embedding_matrix],
                     input_length=80,
                     trainable=False))
model2.add(LSTM(300, dropout_W=0.2, dropout_U=0.2))

merged_model = Sequential()
merged_model.add(Merge([model1, model2], mode='concat'))
merged_model.add(BatchNormalization())

merged_model.add(Dense(200))
merged_model.add(PReLU())
merged_model.add(Dropout(0.2))
merged_model.add(BatchNormalization())

merged_model.add(Dense(200))
merged_model.add(PReLU())
merged_model.add(Dropout(0.2))
merged_model.add(BatchNormalization())

merged_model.add(Dense(2))
merged_model.add(Activation('softmax'))

merged_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy', 'precision', 'recall'])
checkpoint = ModelCheckpoint('../data/weights.h5', monitor='val_acc', save_best_only=True, verbose=2)

merged_model.fit([x_train_title, x_train_textdata], y=ytrain_enc,
                 batch_size=128, nb_epoch=200, verbose=2, validation_split=0.1,
                 shuffle=True, callbacks=[checkpoint])


================================================
FILE: deepnets/LSTM_Titles.py
================================================
# coding: utf-8
"""
Simple LSTM only on Titles
@author: Abhishek Thakur
"""

import pandas as pd
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
from keras.layers.normalization import BatchNormalization
from keras.utils import np_utils
from keras.callbacks import ModelCheckpoint
from keras.layers.advanced_activations import PReLU
from keras.preprocessing import sequence, text

train = pd.read_csv('../data/train.csv')
test = pd.read_csv('../data/test.csv')

y_train = train.label.values
y_test = test.label.values

tk = text.Tokenizer(nb_words=200000)
train.link_name = train.link_name.astype(str)
test.link_name = test.link_name.astype(str)
train.textdata = train.textdata.astype(str)
test.textdata = test.textdata.astype(str)

max_len = 80

tk.fit_on_texts(list(train.link_name.values) + list(train.textdata.values) + list(test.link_name.values) + list(
    test.textdata.values))
x_train_title = tk.texts_to_sequences(train.link_name.values)
x_train_title = sequence.pad_sequences(x_train_title, maxlen=max_len)

x_train_textdata = tk.texts_to_sequences(train.textdata.values)
x_train_textdata = sequence.pad_sequences(x_train_textdata, maxlen=max_len)

x_test_title = tk.texts_to_sequences(test.link_name.values)
x_test_title = sequence.pad_sequences(x_test_title, maxlen=max_len)

x_test_textdata = tk.texts_to_sequences(test.textdata.values)
x_test_textdata = sequence.pad_sequences(x_test_textdata, maxlen=max_len)

word_index = tk.word_index
ytrain_enc = np_utils.to_categorical(y_train)

model = Sequential()
model.add(Embedding(len(word_index) + 1, 300, input_length=80, dropout=0.2))
model.add(LSTM(300, dropout_W=0.2, dropout_U=0.2))

model.add(Dense(200))
model.add(PReLU())
model.add(Dropout(0.2))
model.add(BatchNormalization())

model.add(Dense(200))
model.add(PReLU())
model.add(Dropout(0.2))
model.add(BatchNormalization())

model.add(Dense(2))
model.add(Activation('softmax'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy', 'precision', 'recall'])

checkpoint = ModelCheckpoint('../data/weights.h5', monitor='val_acc', save_best_only=True, verbose=2)

model.fit(x_train_title, y=ytrain_enc,
                 batch_size=128, nb_epoch=200, verbose=2, validation_split=0.1,
                 shuffle=True, callbacks=[checkpoint])


================================================
FILE: deepnets/TDD_Title_Content_with_GloVe.py
================================================

# coding: utf-8
"""
Time distributed dense with GloVe embeddings (title + content text)
@author: Abhishek Thakur
"""
import pandas as pd
import numpy as np
from tqdm import tqdm
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.embeddings import Embedding
from keras.layers.normalization import BatchNormalization
from keras.utils import np_utils
from keras.engine.topology import Merge
from keras.layers import TimeDistributed, Lambda
from keras.callbacks import ModelCheckpoint
from keras import backend as K
from keras.layers.advanced_activations import PReLU
from keras.preprocessing import sequence, text

train = pd.read_csv('../data/train.csv')
test = pd.read_csv('../data/test.csv')

y_train = train.label.values
y_test = test.label.values

tk = text.Tokenizer(nb_words=200000)

train.link_name = train.link_name.astype(str)
test.link_name = test.link_name.astype(str)

train.textdata = train.textdata.astype(str)
test.textdata = test.textdata.astype(str)

max_len = 80

tk.fit_on_texts(list(train.link_name.values) + list(train.textdata.values) + list(test.link_name.values) + list(test.textdata.values))
x_train_title = tk.texts_to_sequences(train.link_name.values)
x_train_title = sequence.pad_sequences(x_train_title, maxlen=max_len)

x_train_textdata = tk.texts_to_sequences(train.textdata.values)
x_train_textdata = sequence.pad_sequences(x_train_textdata, maxlen=max_len)


x_test_title = tk.texts_to_sequences(test.link_name.values)
x_test_title = sequence.pad_sequences(x_test_title, maxlen=max_len)

x_test_textdata = tk.texts_to_sequences(test.textdata.values)
x_test_textdata = sequence.pad_sequences(x_test_textdata, maxlen=max_len)

word_index = tk.word_index
ytrain_enc = np_utils.to_categorical(y_train)

embeddings_index = {}
f = open('../data/glove.840B.300d.txt')
for line in tqdm(f):
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

embedding_matrix = np.zeros((len(word_index) + 1, 300))
for word, i in tqdm(word_index.items()):
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

model1 = Sequential()
model1.add(Embedding(len(word_index) + 1,
                     300,
                     weights=[embedding_matrix],
                     input_length=80,
                     trainable=False))
model1.add(TimeDistributed(Dense(300, activation='relu')))
model1.add(Lambda(lambda x: K.sum(x, axis=1), output_shape=(300,)))

model2 = Sequential()
model2.add(Embedding(len(word_index) + 1,
                     300,
                     weights=[embedding_matrix],
                     input_length=80,
                     trainable=False))
model2.add(TimeDistributed(Dense(300, activation='relu')))
model2.add(Lambda(lambda x: K.sum(x, axis=1), output_shape=(300,)))


merged_model = Sequential()
merged_model.add(Merge([model1, model2], mode='concat'))
merged_model.add(BatchNormalization())

merged_model.add(Dense(200))
merged_model.add(PReLU())
merged_model.add(Dropout(0.2))
merged_model.add(BatchNormalization())

merged_model.add(Dense(200))
merged_model.add(PReLU())
merged_model.add(Dropout(0.2))
merged_model.add(BatchNormalization())

merged_model.add(Dense(2))
merged_model.add(Activation('softmax'))

merged_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy', 'precision', 'recall'])
checkpoint = ModelCheckpoint('../data/weights.h5', monitor='val_acc', save_best_only=True, verbose=2)

merged_model.fit([x_train_title, x_train_textdata], y=ytrain_enc, 
                 batch_size=128, nb_epoch=200, verbose=2, validation_split=0.1, 
                 shuffle=True, callbacks=[checkpoint])