Repository: abhishekkrthakur/clickbaits_revisited Branch: master Commit: de9020635c58 Files: 14 Total size: 30.7 KB Directory structure: gitextract_ut2cqja5/ ├── .gitignore ├── LICENSE ├── README.md ├── data/ │ └── .keep ├── data_processing/ │ ├── create_data.py │ ├── data_cleaning.py │ ├── feature_generation.py │ ├── html_scraper.py │ └── merge_data.py └── deepnets/ ├── LSTM_Title_Content.py ├── LSTM_Title_Content_Numerical_with_GloVe.py ├── LSTM_Title_Content_with_GloVe.py ├── LSTM_Titles.py └── TDD_Title_Content_with_GloVe.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: .gitignore ================================================ .idea/** *.pyc data/*.pkl data/*.csv data/*.zip ================================================ FILE: LICENSE ================================================ MIT License Copyright (c) 2017 Abhishek Thakur Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ================================================ FILE: README.md ================================================ # Clickbaits Revisited This repository provides the code used for : https://www.linkedin.com/pulse/clickbaits-revisited-deep-learning-title-content-features-thakur ### Data Collection To run the code you must first collect the data: - Get facebook page parser from: https://github.com/minimaxir/facebook-page-post-scraper - Run the python script: get_fb_posts_fb_page.py for buzzfeed, upworthy, cnn, nytimes, wikinews, clickhole and StopClickBaitOfficial - Save all the CSVs obtained from above step in data/ ### Data Pre-Processing After the data has been collected, you need to run the following files to obtain training and test data. The order is important! - $ cd data_processing - $ python create_data.py - $ python html_scraper.py - $ python feature_generation.py - $ python merge_data.py - $ python data_cleaning.py After the steps above, you will end up with train.csv and test.csv in data/ Please note that the above steps will require a lot of memory. So, if you have anything less than 64GB, please modify the code according to your needs. ### GloVe embeddings Obtain GloVe embeddings from the following URL: http://nlp.stanford.edu/data/glove.840B.300d.zip Extract the zip and place the CSV in data/ ### Deepnets After all the above steps, you are ready to go and play around with the deep neural networks to classify clickbaits Change directory to deepnets/ cd deepnets/ The deepnets are as folllows: LSTM_Title.py : LSTM on title text without GloVe embeddings LSTM_Title_Content.py : LSTM on title text and content text without GloVe embeddings LSTM_Title_Content_with_GloVe.py : LSTM on title and content text with GloVe emebeddings TDD_Title_Content_with_Glove.py : Time distributed dense on title and content text with GloVe embeddings LSTM_Title_Content_Numerical_with_GloVe.py : LSTM on title + content text with GloVe embeddings & dense net for numerical features. ### Performance The network with LSTM on title and content text with GloVe embeddings with numerical features achieves an accuracy of 0.996 during validation and 0.992 on the test set. All models were trained on NVIDIA TitanX, Ubuntu 16.04 system with 64GB memory. ================================================ FILE: data/.keep ================================================ ================================================ FILE: data_processing/create_data.py ================================================ # coding: utf-8 """ Create usable data after scraping public facebook pages @author: Abhishek Thakur """ import pandas as pd buzzfeed = pd.read_csv('../data/buzzfeed_facebook_statuses.csv', usecols=['link_name', 'status_type', 'status_link']) clickhole = pd.read_csv('../data/clickhole_facebook_statuses.csv', usecols=['link_name', 'status_type', 'status_link']) cnn = pd.read_csv('../data/cnn_facebook_statuses.csv', usecols=['link_name', 'status_type', 'status_link']) nytimes = pd.read_csv('../data/nytimes_facebook_statuses.csv', usecols=['link_name', 'status_type', 'status_link']) stopclickbait = pd.read_csv('../data/StopClickBaitOfficial_facebook_statuses.csv', usecols=['link_name', 'status_type', 'status_link']) upworthy = pd.read_csv('../data/Upworthy_facebook_statuses.csv', usecols=['link_name', 'status_type', 'status_link']) wikinews = pd.read_csv('../data/wikinews_facebook_statuses.csv', usecols=['link_name', 'status_type', 'status_link']) wikinews.link_name = wikinews.link_name.apply(lambda x: str(x).replace(' - Wikinews, the free news source', '')) buzzfeed = buzzfeed[buzzfeed.status_type == 'link'] clickhole = clickhole[clickhole.status_type == 'link'] cnn = cnn[cnn.status_type == 'link'] nytimes = nytimes[nytimes.status_type == 'link'] stopclickbait = stopclickbait[stopclickbait.status_type == 'link'] upworthy = upworthy[upworthy.status_type == 'link'] wikinews = wikinews[wikinews.status_type == 'link'] cnn = cnn.sample(frac=1).head(10000) nytimes = nytimes.sample(frac=1).head(13000) clickbaits = pd.concat([buzzfeed, clickhole, stopclickbait, upworthy]) non_clickbaits = pd.concat([cnn, nytimes, wikinews]) clickbaits.to_csv('../data/clickbaits.csv', index=False) non_clickbaits.to_csv('../data/non_clickbaits.csv', index=False) ================================================ FILE: data_processing/data_cleaning.py ================================================ # coding: utf-8 """ Cleans the data more and separates into training and test sets @author: Abhishek Thakur """ import pandas as pd from sklearn.cross_validation import train_test_split internet_stop_words = ['site', 'navigation', 'new', 'times', 'york', 'information', 'index', 'like', 'related', 'search', 'follow', 'subscribe', 'subscribed', 'subscribing', 'spam', 'twitter', 'pinterest', 'facebook', 'google', 'privacy', 'policy', 'feedback', 'tweet', 'tweets', 'disclaimer', 'buzzfeed', 'clickhole', 'upworthy', 'cnn', 'nytimes', 'wikinews', 'instagram', 'newsletter', 'copyright', 'cnn.com', 'nytimes.com', 'buzzfeed.com', 'upworthy.com', 'clickhole.com', 'wikinews.com'] def remove_internet_stop_words(x): return ' '.join([word for word in str(x).lower().split() if word not in internet_stop_words]) df = pd.read_csv('../data/fulldata.csv') df = df.drop_duplicates() df.textdata = df.textdata.apply(lambda x: str(x).replace('report an issue thanks', '').strip()) df.textdata = df.textdata.apply(remove_internet_stop_words) df.link_name = df.link_name.apply(remove_internet_stop_words) df = df.drop(['status_type', 'status_link'], axis=1) train_df, test_df = train_test_split(df, stratify=df.label.values, random_state=42, test_size=0.1) train_df.to_csv('../data/train.csv', index=False, encoding='utf-8') test_df.to_csv('../data/test.csv', index=False, encoding='utf-8') ================================================ FILE: data_processing/feature_generation.py ================================================ # coding: utf-8 """ Generate numerical content based and text features from HTMLs @author: Abhishek Thakur """ import pandas as pd import cPickle from bs4 import BeautifulSoup from goose import Goose from collections import Counter import string from joblib import Parallel, delayed import sys from tqdm import tqdm stop_domains = ['buzzfeed', 'clickhole', 'cnn', 'wikinews', 'upworthy', 'nytimes'] def features(html): try: soup = BeautifulSoup(html, "lxml") g = Goose() try: goose_article = g.extract(raw_html=html) except TypeError: goose_article = None except IndexError: goose_article = None size = sys.getsizeof(html) html_len = len(html) number_of_links = len(soup.find_all('a')) number_of_buttons = len(soup.find_all('button')) number_of_inputs = len(soup.find_all('input')) number_of_ul = len(soup.find_all('ul')) number_of_ol = len(soup.find_all('ol')) number_of_lists = number_of_ol + number_of_ul number_of_h1 = len(soup.find_all('h1')) number_of_h2 = len(soup.find_all('h2')) if number_of_h1 > 0: h1_len = 0 h1_text = '' for x in soup.find_all('h1'): text = x.get_text().strip() h1_text += text + ' ' h1_len += len(text) total_h1_len = h1_len avg_h1_len = h1_len * 1. / number_of_h1 else: total_h1_len = 0 avg_h1_len = 0 h1_text = '' if number_of_h2 > 0: h2_len = 0 h2_text = '' for x in soup.find_all('h2'): text = x.get_text().strip() h2_len += len(text) h2_text += text + ' ' total_h2_len = h2_len avg_h2_len = h2_len * 1. / number_of_h2 else: total_h2_len = 0 avg_h2_len = 0 h2_text = '' if goose_article is not None: textdata = goose_article.meta_description + ' ' + h1_text + ' ' + h2_text textdata = "".join(l for l in textdata if l not in string.punctuation) textdata = textdata.strip().lower().split() textdata = [word for word in textdata if word.lower() not in stop_domains] textdata = ' '.join(textdata) else: textdata = h1_text + ' ' + h2_text textdata = "".join(l for l in textdata if l not in string.punctuation) textdata = textdata.strip().lower().split() textdata = [word for word in textdata if word.lower() not in stop_domains] textdata = ' '.join(textdata) number_of_images = len(soup.find_all('img')) number_of_tags = len([x.name for x in soup.find_all()]) number_of_unique_tags = len(Counter([x.name for x in soup.find_all()])) return [size, html_len, number_of_links, number_of_buttons, number_of_inputs, number_of_ul, number_of_ol, number_of_lists, number_of_h1, number_of_h2, total_h1_len, total_h2_len, avg_h1_len, avg_h2_len, number_of_images, number_of_tags, number_of_unique_tags, textdata] except: return [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, "no data"] clickbait_html = cPickle.load(open('../data/clickbait_html.pkl')) clickbait_features = Parallel(n_jobs=50)(delayed(features)(html) for html in tqdm(clickbait_html)) clickbait_features_df = pd.DataFrame(clickbait_features, columns=["size", "html_len", "number_of_links", "number_of_buttons", "number_of_inputs", "number_of_ul", "number_of_ol", "number_of_lists", "number_of_h1", "number_of_h2", "total_h1_len", "total_h2_len", "avg_h1_len", "avg_h2_len", "number_of_images", "number_of_tags", "number_of_unique_tags", "textdata"]) clickbait_features_df.to_csv('../data/clickbait_website_features.csv', index=False, encoding='utf-8') non_clickbait_html = cPickle.load(open('../data/non_clickbait_html.pkl')) non_clickbait_features = Parallel(n_jobs=50)(delayed(features)(html) for html in tqdm(non_clickbait_html)) non_clickbait_features_df = pd.DataFrame(non_clickbait_features, columns=["size", "html_len", "number_of_links", "number_of_buttons", "number_of_inputs", "number_of_ul", "number_of_ol", "number_of_lists", "number_of_h1", "number_of_h2", "total_h1_len", "total_h2_len", "avg_h1_len", "avg_h2_len", "number_of_images", "number_of_tags", "number_of_unique_tags", "textdata"]) non_clickbait_features_df.to_csv('../data/non_clickbait_website_features.csv', index=False, encoding='utf-8') ================================================ FILE: data_processing/html_scraper.py ================================================ # coding: utf-8 """ Scrape and save html for all links in clickbait and non_clickbait CSVs @author: Abhishek Thakur """ import sys reload(sys) sys.setdefaultencoding('UTF8') import pandas as pd import requests from joblib import Parallel, delayed import cPickle from tqdm import tqdm def html_extractor(url): try: cookies = dict(cookies_are='working') r = requests.get(url, cookies=cookies) return r.text except: return "no html" clickbaits = pd.read_csv('../data/clickbaits.csv') non_clickbaits = pd.read_csv('../data/non_clickbaits.csv') clickbait_urls = clickbaits.status_link.values non_clickbait_urls = non_clickbaits.status_link.values clickbait_html = Parallel(n_jobs=20)(delayed(html_extractor)(u) for u in tqdm(clickbait_urls)) cPickle.dump(clickbait_html, open('../data/clickbait_html.pkl', 'wb'), -1) non_clickbait_html = Parallel(n_jobs=20)(delayed(html_extractor)(u) for u in tqdm(non_clickbait_urls)) cPickle.dump(non_clickbait_html, open('../data/non_clickbait_html.pkl', 'wb'), -1) ================================================ FILE: data_processing/merge_data.py ================================================ # coding: utf-8 """ Merge original clickbait CSVs with features @author: Abhishek Thakur """ import pandas as pd clickbait_titles = pd.read_csv('../data/clickbaits.csv') non_clickbait_titles = pd.read_csv('../data/non_clickbaits.csv') clickbait_features = pd.read_csv('../data/clickbait_website_features.csv') non_clickbait_features = pd.read_csv('../data/non_clickbait_website_features.csv') clickbait_full = pd.concat([clickbait_titles, clickbait_features], axis=1) non_clickbait_full = pd.concat([non_clickbait_titles, non_clickbait_features], axis=1) clickbait_full['label'] = 1 non_clickbait_full['label'] = 0 fulldata = pd.concat([clickbait_full, non_clickbait_full]) fulldata = fulldata.sample(frac=1).reset_index(drop=True) fulldata = fulldata[fulldata.html_len != -1] fulldata.to_csv('../data/fulldata.csv', index=False) ================================================ FILE: deepnets/LSTM_Title_Content.py ================================================ # coding: utf-8 """ LSTM on title and content text @author: Abhishek Thakur """ import pandas as pd from keras.models import Sequential from keras.layers.core import Dense, Activation, Dropout from keras.layers.embeddings import Embedding from keras.layers.recurrent import LSTM from keras.layers.normalization import BatchNormalization from keras.utils import np_utils from keras.engine.topology import Merge from keras.callbacks import ModelCheckpoint from keras.layers.advanced_activations import PReLU from keras.preprocessing import sequence, text train = pd.read_csv('../data/train.csv') test = pd.read_csv('../data/test.csv') y_train = train.label.values y_test = test.label.values tk = text.Tokenizer(nb_words=200000) train.link_name = train.link_name.astype(str) test.link_name = test.link_name.astype(str) train.textdata = train.textdata.astype(str) test.textdata = test.textdata.astype(str) max_len = 80 tk.fit_on_texts(list(train.link_name.values) + list(train.textdata.values) + list(test.link_name.values) + list( test.textdata.values)) x_train_title = tk.texts_to_sequences(train.link_name.values) x_train_title = sequence.pad_sequences(x_train_title, maxlen=max_len) x_train_textdata = tk.texts_to_sequences(train.textdata.values) x_train_textdata = sequence.pad_sequences(x_train_textdata, maxlen=max_len) x_test_title = tk.texts_to_sequences(test.link_name.values) x_test_title = sequence.pad_sequences(x_test_title, maxlen=max_len) x_test_textdata = tk.texts_to_sequences(test.textdata.values) x_test_textdata = sequence.pad_sequences(x_test_textdata, maxlen=max_len) word_index = tk.word_index ytrain_enc = np_utils.to_categorical(y_train) model1 = Sequential() model1.add(Embedding(len(word_index) + 1, 300, input_length=80, dropout=0.2)) model1.add(LSTM(300, dropout_W=0.2, dropout_U=0.2)) model2 = Sequential() model2.add(Embedding(len(word_index) + 1, 300, input_length=80, dropout=0.2)) model2.add(LSTM(300, dropout_W=0.2, dropout_U=0.2)) merged_model = Sequential() merged_model.add(Merge([model1, model2], mode='concat')) merged_model.add(BatchNormalization()) merged_model.add(Dense(200)) merged_model.add(PReLU()) merged_model.add(Dropout(0.2)) merged_model.add(BatchNormalization()) merged_model.add(Dense(200)) merged_model.add(PReLU()) merged_model.add(Dropout(0.2)) merged_model.add(BatchNormalization()) merged_model.add(Dense(2)) merged_model.add(Activation('softmax')) merged_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy', 'precision', 'recall']) checkpoint = ModelCheckpoint('../data/weights.h5', monitor='val_acc', save_best_only=True, verbose=2) merged_model.fit([x_train_title, x_train_textdata], y=ytrain_enc, batch_size=128, nb_epoch=200, verbose=2, validation_split=0.1, shuffle=True, callbacks=[checkpoint]) ================================================ FILE: deepnets/LSTM_Title_Content_Numerical_with_GloVe.py ================================================ # coding: utf-8 """ LSTM on title + content text + numerical features with GloVe embeddings @author: Abhishek Thakur """ import pandas as pd import numpy as np from tqdm import tqdm from keras.models import Sequential from keras.layers.core import Dense, Activation, Dropout from keras.layers.embeddings import Embedding from keras.layers.recurrent import LSTM from keras.layers.normalization import BatchNormalization from keras.utils import np_utils from keras.engine.topology import Merge from keras.callbacks import ModelCheckpoint from keras.layers.advanced_activations import PReLU from keras.preprocessing import sequence, text from sklearn import preprocessing train = pd.read_csv('../data/train_v2.csv') test = pd.read_csv('../data/test_v2.csv') y_train = train.label.values y_test = test.label.values train_num = train[["size", "html_len", "number_of_links", "number_of_buttons", "number_of_inputs", "number_of_ul", "number_of_ol", "number_of_lists", "number_of_h1", "number_of_h2", "total_h1_len", "total_h2_len", "avg_h1_len", "avg_h2_len", "number_of_images", "number_of_tags", "number_of_unique_tags"]].values test_num = test[["size", "html_len", "number_of_links", "number_of_buttons", "number_of_inputs", "number_of_ul", "number_of_ol", "number_of_lists", "number_of_h1", "number_of_h2", "total_h1_len", "total_h2_len", "avg_h1_len", "avg_h2_len", "number_of_images", "number_of_tags", "number_of_unique_tags"]].values tk = text.Tokenizer(nb_words=200000) train.link_name = train.link_name.astype(str) test.link_name = test.link_name.astype(str) train.textdata = train.textdata.astype(str) test.textdata = test.textdata.astype(str) max_len = 80 tk.fit_on_texts(list(train.link_name.values) + list(train.textdata.values) + list(test.link_name.values) + list( test.textdata.values)) x_train_title = tk.texts_to_sequences(train.link_name.values) x_train_title = sequence.pad_sequences(x_train_title, maxlen=max_len) x_train_textdata = tk.texts_to_sequences(train.textdata.values) x_train_textdata = sequence.pad_sequences(x_train_textdata, maxlen=max_len) x_test_title = tk.texts_to_sequences(test.link_name.values) x_test_title = sequence.pad_sequences(x_test_title, maxlen=max_len) x_test_textdata = tk.texts_to_sequences(test.textdata.values) x_test_textdata = sequence.pad_sequences(x_test_textdata, maxlen=max_len) word_index = tk.word_index ytrain_enc = np_utils.to_categorical(y_train) embeddings_index = {} f = open('../data/glove.840B.300d.txt') for line in tqdm(f): values = line.split() word = values[0] coefs = np.asarray(values[1:], dtype='float32') embeddings_index[word] = coefs f.close() embedding_matrix = np.zeros((len(word_index) + 1, 300)) for word, i in tqdm(word_index.items()): embedding_vector = embeddings_index.get(word) if embedding_vector is not None: embedding_matrix[i] = embedding_vector scl = preprocessing.StandardScaler() train_num_scl = scl.fit_transform(train_num) test_num_scl = scl.transform(test_num) model1 = Sequential() model1.add(Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], input_length=80, trainable=False)) model1.add(LSTM(300, dropout_W=0.2, dropout_U=0.2)) model2 = Sequential() model2.add(Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], input_length=80, trainable=False)) model2.add(LSTM(300, dropout_W=0.2, dropout_U=0.2)) model3 = Sequential() model3.add(Dense(100, input_dim=train_num_scl.shape[1])) model3.add(PReLU()) model3.add(Dropout(0.2)) model3.add(BatchNormalization()) model3.add(Dense(100)) model3.add(PReLU()) model3.add(Dropout(0.2)) model3.add(BatchNormalization()) merged_model = Sequential() merged_model.add(Merge([model1, model2, model3], mode='concat')) merged_model.add(BatchNormalization()) merged_model.add(Dense(200)) merged_model.add(PReLU()) merged_model.add(Dropout(0.2)) merged_model.add(BatchNormalization()) merged_model.add(Dense(200)) merged_model.add(PReLU()) merged_model.add(Dropout(0.2)) merged_model.add(BatchNormalization()) merged_model.add(Dense(2)) merged_model.add(Activation('softmax')) merged_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy', 'precision', 'recall']) checkpoint = ModelCheckpoint('../data/weights_title+content_tdd.h5', monitor='val_acc', save_best_only=True, verbose=2) merged_model.fit([x_train_title, x_train_textdata, train_num_scl], y=ytrain_enc, batch_size=128, nb_epoch=200, verbose=2, validation_split=0.1, shuffle=True, callbacks=[checkpoint]) ================================================ FILE: deepnets/LSTM_Title_Content_with_GloVe.py ================================================ # coding: utf-8 """ LSTM with title+content text with GloVe embeddings @author: Abhishek Thakur """ import pandas as pd import numpy as np from tqdm import tqdm from keras.models import Sequential from keras.layers.core import Dense, Activation, Dropout from keras.layers.embeddings import Embedding from keras.layers.recurrent import LSTM from keras.layers.normalization import BatchNormalization from keras.utils import np_utils from keras.engine.topology import Merge from keras.callbacks import ModelCheckpoint from keras.layers.advanced_activations import PReLU from keras.preprocessing import sequence, text train = pd.read_csv('../data/train.csv') test = pd.read_csv('../data/test.csv') y_train = train.label.values y_test = test.label.values tk = text.Tokenizer(nb_words=200000) train.link_name = train.link_name.astype(str) test.link_name = test.link_name.astype(str) train.textdata = train.textdata.astype(str) test.textdata = test.textdata.astype(str) max_len = 80 tk.fit_on_texts(list(train.link_name.values) + list(train.textdata.values) + list(test.link_name.values) + list( test.textdata.values)) x_train_title = tk.texts_to_sequences(train.link_name.values) x_train_title = sequence.pad_sequences(x_train_title, maxlen=max_len) x_train_textdata = tk.texts_to_sequences(train.textdata.values) x_train_textdata = sequence.pad_sequences(x_train_textdata, maxlen=max_len) x_test_title = tk.texts_to_sequences(test.link_name.values) x_test_title = sequence.pad_sequences(x_test_title, maxlen=max_len) x_test_textdata = tk.texts_to_sequences(test.textdata.values) x_test_textdata = sequence.pad_sequences(x_test_textdata, maxlen=max_len) word_index = tk.word_index ytrain_enc = np_utils.to_categorical(y_train) embeddings_index = {} f = open('../data/glove.840B.300d.txt') for line in tqdm(f): values = line.split() word = values[0] coefs = np.asarray(values[1:], dtype='float32') embeddings_index[word] = coefs f.close() embedding_matrix = np.zeros((len(word_index) + 1, 300)) for word, i in tqdm(word_index.items()): embedding_vector = embeddings_index.get(word) if embedding_vector is not None: embedding_matrix[i] = embedding_vector model1 = Sequential() model1.add(Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], input_length=80, trainable=False)) model1.add(LSTM(300, dropout_W=0.2, dropout_U=0.2)) model2 = Sequential() model2.add(Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], input_length=80, trainable=False)) model2.add(LSTM(300, dropout_W=0.2, dropout_U=0.2)) merged_model = Sequential() merged_model.add(Merge([model1, model2], mode='concat')) merged_model.add(BatchNormalization()) merged_model.add(Dense(200)) merged_model.add(PReLU()) merged_model.add(Dropout(0.2)) merged_model.add(BatchNormalization()) merged_model.add(Dense(200)) merged_model.add(PReLU()) merged_model.add(Dropout(0.2)) merged_model.add(BatchNormalization()) merged_model.add(Dense(2)) merged_model.add(Activation('softmax')) merged_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy', 'precision', 'recall']) checkpoint = ModelCheckpoint('../data/weights.h5', monitor='val_acc', save_best_only=True, verbose=2) merged_model.fit([x_train_title, x_train_textdata], y=ytrain_enc, batch_size=128, nb_epoch=200, verbose=2, validation_split=0.1, shuffle=True, callbacks=[checkpoint]) ================================================ FILE: deepnets/LSTM_Titles.py ================================================ # coding: utf-8 """ Simple LSTM only on Titles @author: Abhishek Thakur """ import pandas as pd from keras.models import Sequential from keras.layers.core import Dense, Activation, Dropout from keras.layers.embeddings import Embedding from keras.layers.recurrent import LSTM from keras.layers.normalization import BatchNormalization from keras.utils import np_utils from keras.callbacks import ModelCheckpoint from keras.layers.advanced_activations import PReLU from keras.preprocessing import sequence, text train = pd.read_csv('../data/train.csv') test = pd.read_csv('../data/test.csv') y_train = train.label.values y_test = test.label.values tk = text.Tokenizer(nb_words=200000) train.link_name = train.link_name.astype(str) test.link_name = test.link_name.astype(str) train.textdata = train.textdata.astype(str) test.textdata = test.textdata.astype(str) max_len = 80 tk.fit_on_texts(list(train.link_name.values) + list(train.textdata.values) + list(test.link_name.values) + list( test.textdata.values)) x_train_title = tk.texts_to_sequences(train.link_name.values) x_train_title = sequence.pad_sequences(x_train_title, maxlen=max_len) x_train_textdata = tk.texts_to_sequences(train.textdata.values) x_train_textdata = sequence.pad_sequences(x_train_textdata, maxlen=max_len) x_test_title = tk.texts_to_sequences(test.link_name.values) x_test_title = sequence.pad_sequences(x_test_title, maxlen=max_len) x_test_textdata = tk.texts_to_sequences(test.textdata.values) x_test_textdata = sequence.pad_sequences(x_test_textdata, maxlen=max_len) word_index = tk.word_index ytrain_enc = np_utils.to_categorical(y_train) model = Sequential() model.add(Embedding(len(word_index) + 1, 300, input_length=80, dropout=0.2)) model.add(LSTM(300, dropout_W=0.2, dropout_U=0.2)) model.add(Dense(200)) model.add(PReLU()) model.add(Dropout(0.2)) model.add(BatchNormalization()) model.add(Dense(200)) model.add(PReLU()) model.add(Dropout(0.2)) model.add(BatchNormalization()) model.add(Dense(2)) model.add(Activation('softmax')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy', 'precision', 'recall']) checkpoint = ModelCheckpoint('../data/weights.h5', monitor='val_acc', save_best_only=True, verbose=2) model.fit(x_train_title, y=ytrain_enc, batch_size=128, nb_epoch=200, verbose=2, validation_split=0.1, shuffle=True, callbacks=[checkpoint]) ================================================ FILE: deepnets/TDD_Title_Content_with_GloVe.py ================================================ # coding: utf-8 """ Time distributed dense with GloVe embeddings (title + content text) @author: Abhishek Thakur """ import pandas as pd import numpy as np from tqdm import tqdm from keras.models import Sequential from keras.layers.core import Dense, Activation, Dropout from keras.layers.embeddings import Embedding from keras.layers.normalization import BatchNormalization from keras.utils import np_utils from keras.engine.topology import Merge from keras.layers import TimeDistributed, Lambda from keras.callbacks import ModelCheckpoint from keras import backend as K from keras.layers.advanced_activations import PReLU from keras.preprocessing import sequence, text train = pd.read_csv('../data/train.csv') test = pd.read_csv('../data/test.csv') y_train = train.label.values y_test = test.label.values tk = text.Tokenizer(nb_words=200000) train.link_name = train.link_name.astype(str) test.link_name = test.link_name.astype(str) train.textdata = train.textdata.astype(str) test.textdata = test.textdata.astype(str) max_len = 80 tk.fit_on_texts(list(train.link_name.values) + list(train.textdata.values) + list(test.link_name.values) + list(test.textdata.values)) x_train_title = tk.texts_to_sequences(train.link_name.values) x_train_title = sequence.pad_sequences(x_train_title, maxlen=max_len) x_train_textdata = tk.texts_to_sequences(train.textdata.values) x_train_textdata = sequence.pad_sequences(x_train_textdata, maxlen=max_len) x_test_title = tk.texts_to_sequences(test.link_name.values) x_test_title = sequence.pad_sequences(x_test_title, maxlen=max_len) x_test_textdata = tk.texts_to_sequences(test.textdata.values) x_test_textdata = sequence.pad_sequences(x_test_textdata, maxlen=max_len) word_index = tk.word_index ytrain_enc = np_utils.to_categorical(y_train) embeddings_index = {} f = open('../data/glove.840B.300d.txt') for line in tqdm(f): values = line.split() word = values[0] coefs = np.asarray(values[1:], dtype='float32') embeddings_index[word] = coefs f.close() embedding_matrix = np.zeros((len(word_index) + 1, 300)) for word, i in tqdm(word_index.items()): embedding_vector = embeddings_index.get(word) if embedding_vector is not None: embedding_matrix[i] = embedding_vector model1 = Sequential() model1.add(Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], input_length=80, trainable=False)) model1.add(TimeDistributed(Dense(300, activation='relu'))) model1.add(Lambda(lambda x: K.sum(x, axis=1), output_shape=(300,))) model2 = Sequential() model2.add(Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], input_length=80, trainable=False)) model2.add(TimeDistributed(Dense(300, activation='relu'))) model2.add(Lambda(lambda x: K.sum(x, axis=1), output_shape=(300,))) merged_model = Sequential() merged_model.add(Merge([model1, model2], mode='concat')) merged_model.add(BatchNormalization()) merged_model.add(Dense(200)) merged_model.add(PReLU()) merged_model.add(Dropout(0.2)) merged_model.add(BatchNormalization()) merged_model.add(Dense(200)) merged_model.add(PReLU()) merged_model.add(Dropout(0.2)) merged_model.add(BatchNormalization()) merged_model.add(Dense(2)) merged_model.add(Activation('softmax')) merged_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy', 'precision', 'recall']) checkpoint = ModelCheckpoint('../data/weights.h5', monitor='val_acc', save_best_only=True, verbose=2) merged_model.fit([x_train_title, x_train_textdata], y=ytrain_enc, batch_size=128, nb_epoch=200, verbose=2, validation_split=0.1, shuffle=True, callbacks=[checkpoint])