[
  {
    "path": ".gitignore",
    "content": ".idea/**\n*.pyc\ndata/*.pkl\ndata/*.csv\ndata/*.zip\n"
  },
  {
    "path": "LICENSE",
    "content": "MIT License\n\nCopyright (c) 2017 Abhishek Thakur\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n"
  },
  {
    "path": "README.md",
    "content": "# Clickbaits Revisited\n\nThis repository provides the code used for : https://www.linkedin.com/pulse/clickbaits-revisited-deep-learning-title-content-features-thakur\n\n\n### Data Collection\nTo run the code you must first collect the data:\n\n- Get facebook page parser from: https://github.com/minimaxir/facebook-page-post-scraper\n- Run the python script: get_fb_posts_fb_page.py for buzzfeed, upworthy, cnn, nytimes, wikinews, clickhole and StopClickBaitOfficial\n- Save all the CSVs obtained from above step in data/\n\n\n### Data Pre-Processing\nAfter the data has been collected, you need to run the following files to obtain training and test data. The order is important!\n\n    - $ cd data_processing\n    - $ python create_data.py\n    - $ python html_scraper.py\n    - $ python feature_generation.py\n    - $ python merge_data.py\n    - $ python data_cleaning.py\n \nAfter the steps above, you will end up with train.csv and test.csv in data/\n\nPlease note that the above steps will require a lot of memory. So, if you have anything less than 64GB, please modify the code according to your needs.\n\n### GloVe embeddings\n\nObtain GloVe embeddings from the following URL:\n\n    http://nlp.stanford.edu/data/glove.840B.300d.zip\n    \nExtract the zip and place the CSV in data/\n\n\n### Deepnets\n\nAfter all the above steps, you are ready to go and play around with the deep neural networks to classify clickbaits\n\nChange directory to deepnets/\n    \n    cd deepnets/\n     \nThe deepnets are as folllows:\n    \n    LSTM_Title.py : LSTM on title text without GloVe embeddings\n    LSTM_Title_Content.py : LSTM on title text and content text without GloVe embeddings\n    LSTM_Title_Content_with_GloVe.py : LSTM on title and content text with GloVe emebeddings\n    TDD_Title_Content_with_Glove.py : Time distributed dense on title and content text with GloVe embeddings\n    LSTM_Title_Content_Numerical_with_GloVe.py : LSTM on title + content text with GloVe embeddings & dense net for numerical features.\n     \n\n### Performance\n\nThe network with LSTM on title and content text with GloVe embeddings with numerical features achieves an accuracy of 0.996 during validation and 0.992 on the test set.\n\nAll models were trained on NVIDIA TitanX, Ubuntu 16.04 system with 64GB memory.\n\n"
  },
  {
    "path": "data/.keep",
    "content": ""
  },
  {
    "path": "data_processing/create_data.py",
    "content": "# coding: utf-8\n\"\"\"\nCreate usable data after scraping public facebook pages\n@author: Abhishek Thakur\n\"\"\"\n\nimport pandas as pd\n\nbuzzfeed = pd.read_csv('../data/buzzfeed_facebook_statuses.csv',\n                       usecols=['link_name', 'status_type', 'status_link'])\n\nclickhole = pd.read_csv('../data/clickhole_facebook_statuses.csv',\n                        usecols=['link_name', 'status_type', 'status_link'])\n\ncnn = pd.read_csv('../data/cnn_facebook_statuses.csv',\n                  usecols=['link_name', 'status_type', 'status_link'])\n\nnytimes = pd.read_csv('../data/nytimes_facebook_statuses.csv',\n                      usecols=['link_name', 'status_type', 'status_link'])\n\nstopclickbait = pd.read_csv('../data/StopClickBaitOfficial_facebook_statuses.csv',\n                            usecols=['link_name', 'status_type', 'status_link'])\n\nupworthy = pd.read_csv('../data/Upworthy_facebook_statuses.csv',\n                       usecols=['link_name', 'status_type', 'status_link'])\n\nwikinews = pd.read_csv('../data/wikinews_facebook_statuses.csv',\n                       usecols=['link_name', 'status_type', 'status_link'])\n\nwikinews.link_name = wikinews.link_name.apply(lambda x: str(x).replace(' - Wikinews, the free news source', ''))\nbuzzfeed = buzzfeed[buzzfeed.status_type == 'link']\nclickhole = clickhole[clickhole.status_type == 'link']\ncnn = cnn[cnn.status_type == 'link']\nnytimes = nytimes[nytimes.status_type == 'link']\nstopclickbait = stopclickbait[stopclickbait.status_type == 'link']\nupworthy = upworthy[upworthy.status_type == 'link']\nwikinews = wikinews[wikinews.status_type == 'link']\n\ncnn = cnn.sample(frac=1).head(10000)\nnytimes = nytimes.sample(frac=1).head(13000)\n\nclickbaits = pd.concat([buzzfeed, clickhole, stopclickbait, upworthy])\nnon_clickbaits = pd.concat([cnn, nytimes, wikinews])\n\nclickbaits.to_csv('../data/clickbaits.csv', index=False)\nnon_clickbaits.to_csv('../data/non_clickbaits.csv', index=False)\n"
  },
  {
    "path": "data_processing/data_cleaning.py",
    "content": "# coding: utf-8\n\"\"\"\nCleans the data more and separates into training and test sets\n@author: Abhishek Thakur\n\"\"\"\n\nimport pandas as pd\nfrom sklearn.cross_validation import train_test_split\n\ninternet_stop_words = ['site', 'navigation', 'new', 'times', 'york', 'information', 'index',\n                       'like', 'related', 'search', 'follow', 'subscribe', 'subscribed', 'subscribing',\n                       'spam', 'twitter', 'pinterest', 'facebook', 'google', 'privacy', 'policy', 'feedback',\n                       'tweet', 'tweets', 'disclaimer', 'buzzfeed', 'clickhole', 'upworthy', 'cnn', 'nytimes',\n                       'wikinews', 'instagram', 'newsletter', 'copyright', 'cnn.com', 'nytimes.com',\n                       'buzzfeed.com', 'upworthy.com', 'clickhole.com', 'wikinews.com']\n\n\ndef remove_internet_stop_words(x):\n    return ' '.join([word for word in str(x).lower().split() if word not in internet_stop_words])\n\n\ndf = pd.read_csv('../data/fulldata.csv')\ndf = df.drop_duplicates()\n\ndf.textdata = df.textdata.apply(lambda x: str(x).replace('report an issue thanks', '').strip())\n\ndf.textdata = df.textdata.apply(remove_internet_stop_words)\ndf.link_name = df.link_name.apply(remove_internet_stop_words)\n\ndf = df.drop(['status_type', 'status_link'], axis=1)\n\ntrain_df, test_df = train_test_split(df, stratify=df.label.values, random_state=42, test_size=0.1)\n\ntrain_df.to_csv('../data/train.csv', index=False, encoding='utf-8')\ntest_df.to_csv('../data/test.csv', index=False, encoding='utf-8')\n"
  },
  {
    "path": "data_processing/feature_generation.py",
    "content": "# coding: utf-8\n\"\"\"\nGenerate numerical content based and text features from HTMLs\n@author: Abhishek Thakur\n\"\"\"\n\nimport pandas as pd\nimport cPickle\nfrom bs4 import BeautifulSoup\nfrom goose import Goose\nfrom collections import Counter\nimport string\nfrom joblib import Parallel, delayed\nimport sys\nfrom tqdm import tqdm\n\nstop_domains = ['buzzfeed', 'clickhole', 'cnn', 'wikinews', 'upworthy', 'nytimes']\n\n\ndef features(html):\n    try:\n        soup = BeautifulSoup(html, \"lxml\")\n        g = Goose()\n        try:\n            goose_article = g.extract(raw_html=html)\n        except TypeError:\n            goose_article = None\n        except IndexError:\n            goose_article = None\n\n        size = sys.getsizeof(html)\n        html_len = len(html)\n        number_of_links = len(soup.find_all('a'))\n        number_of_buttons = len(soup.find_all('button'))\n        number_of_inputs = len(soup.find_all('input'))\n        number_of_ul = len(soup.find_all('ul'))\n        number_of_ol = len(soup.find_all('ol'))\n        number_of_lists = number_of_ol + number_of_ul\n        number_of_h1 = len(soup.find_all('h1'))\n        number_of_h2 = len(soup.find_all('h2'))\n        if number_of_h1 > 0:\n            h1_len = 0\n            h1_text = ''\n            for x in soup.find_all('h1'):\n                text = x.get_text().strip()\n                h1_text += text + ' '\n                h1_len += len(text)\n            total_h1_len = h1_len\n            avg_h1_len = h1_len * 1. / number_of_h1\n        else:\n            total_h1_len = 0\n            avg_h1_len = 0\n            h1_text = ''\n\n        if number_of_h2 > 0:\n            h2_len = 0\n            h2_text = ''\n            for x in soup.find_all('h2'):\n                text = x.get_text().strip()\n                h2_len += len(text)\n                h2_text += text + ' '\n            total_h2_len = h2_len\n            avg_h2_len = h2_len * 1. / number_of_h2\n        else:\n            total_h2_len = 0\n            avg_h2_len = 0\n            h2_text = ''\n        if goose_article is not None:\n            textdata = goose_article.meta_description + ' ' + h1_text + ' ' + h2_text\n            textdata = \"\".join(l for l in textdata if l not in string.punctuation)\n            textdata = textdata.strip().lower().split()\n            textdata = [word for word in textdata if word.lower() not in stop_domains]\n            textdata = ' '.join(textdata)\n        else:\n            textdata = h1_text + ' ' + h2_text\n            textdata = \"\".join(l for l in textdata if l not in string.punctuation)\n            textdata = textdata.strip().lower().split()\n            textdata = [word for word in textdata if word.lower() not in stop_domains]\n            textdata = ' '.join(textdata)\n\n        number_of_images = len(soup.find_all('img'))\n\n        number_of_tags = len([x.name for x in soup.find_all()])\n        number_of_unique_tags = len(Counter([x.name for x in soup.find_all()]))\n\n        return [size, html_len, number_of_links, number_of_buttons,\n                number_of_inputs, number_of_ul, number_of_ol, number_of_lists,\n                number_of_h1, number_of_h2, total_h1_len, total_h2_len, avg_h1_len, avg_h2_len,\n                number_of_images, number_of_tags, number_of_unique_tags,\n                textdata]\n    except:\n        return [-1, -1, -1, -1,\n                -1, -1, -1, -1,\n                -1, -1, -1, -1, -1, -1,\n                -1, -1, -1,\n                \"no data\"]\n\n\nclickbait_html = cPickle.load(open('../data/clickbait_html.pkl'))\nclickbait_features = Parallel(n_jobs=50)(delayed(features)(html) for html in tqdm(clickbait_html))\n\nclickbait_features_df = pd.DataFrame(clickbait_features,\n                                     columns=[\"size\", \"html_len\", \"number_of_links\", \"number_of_buttons\",\n                                              \"number_of_inputs\", \"number_of_ul\", \"number_of_ol\", \"number_of_lists\",\n                                              \"number_of_h1\", \"number_of_h2\", \"total_h1_len\", \"total_h2_len\",\n                                              \"avg_h1_len\", \"avg_h2_len\",\n                                              \"number_of_images\", \"number_of_tags\", \"number_of_unique_tags\",\n                                              \"textdata\"])\n\nclickbait_features_df.to_csv('../data/clickbait_website_features.csv', index=False, encoding='utf-8')\n\nnon_clickbait_html = cPickle.load(open('../data/non_clickbait_html.pkl'))\nnon_clickbait_features = Parallel(n_jobs=50)(delayed(features)(html) for html in tqdm(non_clickbait_html))\n\nnon_clickbait_features_df = pd.DataFrame(non_clickbait_features,\n                                         columns=[\"size\", \"html_len\", \"number_of_links\", \"number_of_buttons\",\n                                                  \"number_of_inputs\", \"number_of_ul\", \"number_of_ol\", \"number_of_lists\",\n                                                  \"number_of_h1\", \"number_of_h2\", \"total_h1_len\", \"total_h2_len\",\n                                                  \"avg_h1_len\", \"avg_h2_len\",\n                                                  \"number_of_images\", \"number_of_tags\", \"number_of_unique_tags\",\n                                                  \"textdata\"])\n\nnon_clickbait_features_df.to_csv('../data/non_clickbait_website_features.csv', index=False, encoding='utf-8')\n"
  },
  {
    "path": "data_processing/html_scraper.py",
    "content": "# coding: utf-8\n\"\"\"\nScrape and save html for all links in clickbait and non_clickbait CSVs\n@author: Abhishek Thakur\n\"\"\"\nimport sys\nreload(sys)\nsys.setdefaultencoding('UTF8')\n\nimport pandas as pd\nimport requests\nfrom joblib import Parallel, delayed\nimport cPickle\nfrom tqdm import tqdm\n\n\ndef html_extractor(url):\n    try:\n        cookies = dict(cookies_are='working')\n        r = requests.get(url, cookies=cookies)\n        return r.text\n    except:\n        return \"no html\"\n\n\nclickbaits = pd.read_csv('../data/clickbaits.csv')\nnon_clickbaits = pd.read_csv('../data/non_clickbaits.csv')\n\nclickbait_urls = clickbaits.status_link.values\nnon_clickbait_urls = non_clickbaits.status_link.values\n\n\nclickbait_html = Parallel(n_jobs=20)(delayed(html_extractor)(u) for u in tqdm(clickbait_urls))\ncPickle.dump(clickbait_html, open('../data/clickbait_html.pkl', 'wb'), -1)\n\nnon_clickbait_html = Parallel(n_jobs=20)(delayed(html_extractor)(u) for u in tqdm(non_clickbait_urls))\ncPickle.dump(non_clickbait_html, open('../data/non_clickbait_html.pkl', 'wb'), -1)\n"
  },
  {
    "path": "data_processing/merge_data.py",
    "content": "# coding: utf-8\n\"\"\"\nMerge original clickbait CSVs with features\n@author: Abhishek Thakur\n\"\"\"\nimport pandas as pd\n\nclickbait_titles = pd.read_csv('../data/clickbaits.csv')\nnon_clickbait_titles = pd.read_csv('../data/non_clickbaits.csv')\n\nclickbait_features = pd.read_csv('../data/clickbait_website_features.csv')\nnon_clickbait_features = pd.read_csv('../data/non_clickbait_website_features.csv')\n\nclickbait_full = pd.concat([clickbait_titles, clickbait_features], axis=1)\nnon_clickbait_full = pd.concat([non_clickbait_titles, non_clickbait_features], axis=1)\n\nclickbait_full['label'] = 1\nnon_clickbait_full['label'] = 0\n\nfulldata = pd.concat([clickbait_full, non_clickbait_full])\nfulldata = fulldata.sample(frac=1).reset_index(drop=True)\nfulldata = fulldata[fulldata.html_len != -1]\n\nfulldata.to_csv('../data/fulldata.csv', index=False)\n"
  },
  {
    "path": "deepnets/LSTM_Title_Content.py",
    "content": "# coding: utf-8\n\"\"\"\nLSTM on title and content text\n@author: Abhishek Thakur\n\"\"\"\n\nimport pandas as pd\nfrom keras.models import Sequential\nfrom keras.layers.core import Dense, Activation, Dropout\nfrom keras.layers.embeddings import Embedding\nfrom keras.layers.recurrent import LSTM\nfrom keras.layers.normalization import BatchNormalization\nfrom keras.utils import np_utils\nfrom keras.engine.topology import Merge\nfrom keras.callbacks import ModelCheckpoint\nfrom keras.layers.advanced_activations import PReLU\nfrom keras.preprocessing import sequence, text\n\ntrain = pd.read_csv('../data/train.csv')\ntest = pd.read_csv('../data/test.csv')\n\ny_train = train.label.values\ny_test = test.label.values\n\ntk = text.Tokenizer(nb_words=200000)\n\ntrain.link_name = train.link_name.astype(str)\ntest.link_name = test.link_name.astype(str)\n\ntrain.textdata = train.textdata.astype(str)\ntest.textdata = test.textdata.astype(str)\n\nmax_len = 80\n\ntk.fit_on_texts(list(train.link_name.values) + list(train.textdata.values) + list(test.link_name.values) + list(\n    test.textdata.values))\nx_train_title = tk.texts_to_sequences(train.link_name.values)\nx_train_title = sequence.pad_sequences(x_train_title, maxlen=max_len)\n\nx_train_textdata = tk.texts_to_sequences(train.textdata.values)\nx_train_textdata = sequence.pad_sequences(x_train_textdata, maxlen=max_len)\n\nx_test_title = tk.texts_to_sequences(test.link_name.values)\nx_test_title = sequence.pad_sequences(x_test_title, maxlen=max_len)\n\nx_test_textdata = tk.texts_to_sequences(test.textdata.values)\nx_test_textdata = sequence.pad_sequences(x_test_textdata, maxlen=max_len)\n\nword_index = tk.word_index\nytrain_enc = np_utils.to_categorical(y_train)\n\nmodel1 = Sequential()\nmodel1.add(Embedding(len(word_index) + 1, 300, input_length=80, dropout=0.2))\nmodel1.add(LSTM(300, dropout_W=0.2, dropout_U=0.2))\n\nmodel2 = Sequential()\nmodel2.add(Embedding(len(word_index) + 1, 300, input_length=80, dropout=0.2))\nmodel2.add(LSTM(300, dropout_W=0.2, dropout_U=0.2))\n\nmerged_model = Sequential()\nmerged_model.add(Merge([model1, model2], mode='concat'))\nmerged_model.add(BatchNormalization())\n\nmerged_model.add(Dense(200))\nmerged_model.add(PReLU())\nmerged_model.add(Dropout(0.2))\nmerged_model.add(BatchNormalization())\n\nmerged_model.add(Dense(200))\nmerged_model.add(PReLU())\nmerged_model.add(Dropout(0.2))\nmerged_model.add(BatchNormalization())\n\nmerged_model.add(Dense(2))\nmerged_model.add(Activation('softmax'))\n\nmerged_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy', 'precision', 'recall'])\n\ncheckpoint = ModelCheckpoint('../data/weights.h5', monitor='val_acc', save_best_only=True, verbose=2)\n\nmerged_model.fit([x_train_title, x_train_textdata], y=ytrain_enc,\n                 batch_size=128, nb_epoch=200, verbose=2, validation_split=0.1,\n                 shuffle=True, callbacks=[checkpoint])\n"
  },
  {
    "path": "deepnets/LSTM_Title_Content_Numerical_with_GloVe.py",
    "content": "# coding: utf-8\n\"\"\"\nLSTM on title + content text + numerical features with GloVe embeddings\n@author: Abhishek Thakur\n\"\"\"\n\nimport pandas as pd\nimport numpy as np\nfrom tqdm import tqdm\nfrom keras.models import Sequential\nfrom keras.layers.core import Dense, Activation, Dropout\nfrom keras.layers.embeddings import Embedding\nfrom keras.layers.recurrent import LSTM\nfrom keras.layers.normalization import BatchNormalization\nfrom keras.utils import np_utils\nfrom keras.engine.topology import Merge\nfrom keras.callbacks import ModelCheckpoint\nfrom keras.layers.advanced_activations import PReLU\nfrom keras.preprocessing import sequence, text\nfrom sklearn import preprocessing\n\ntrain = pd.read_csv('../data/train_v2.csv')\ntest = pd.read_csv('../data/test_v2.csv')\n\ny_train = train.label.values\ny_test = test.label.values\n\ntrain_num = train[[\"size\", \"html_len\", \"number_of_links\", \"number_of_buttons\",\n                   \"number_of_inputs\", \"number_of_ul\", \"number_of_ol\", \"number_of_lists\",\n                   \"number_of_h1\", \"number_of_h2\", \"total_h1_len\", \"total_h2_len\", \"avg_h1_len\", \"avg_h2_len\",\n                   \"number_of_images\", \"number_of_tags\", \"number_of_unique_tags\"]].values\n\ntest_num = test[[\"size\", \"html_len\", \"number_of_links\", \"number_of_buttons\",\n                 \"number_of_inputs\", \"number_of_ul\", \"number_of_ol\", \"number_of_lists\",\n                 \"number_of_h1\", \"number_of_h2\", \"total_h1_len\", \"total_h2_len\", \"avg_h1_len\", \"avg_h2_len\",\n                 \"number_of_images\", \"number_of_tags\", \"number_of_unique_tags\"]].values\n\ntk = text.Tokenizer(nb_words=200000)\ntrain.link_name = train.link_name.astype(str)\ntest.link_name = test.link_name.astype(str)\ntrain.textdata = train.textdata.astype(str)\ntest.textdata = test.textdata.astype(str)\n\nmax_len = 80\n\ntk.fit_on_texts(list(train.link_name.values) + list(train.textdata.values) + list(test.link_name.values) + list(\n    test.textdata.values))\nx_train_title = tk.texts_to_sequences(train.link_name.values)\nx_train_title = sequence.pad_sequences(x_train_title, maxlen=max_len)\n\nx_train_textdata = tk.texts_to_sequences(train.textdata.values)\nx_train_textdata = sequence.pad_sequences(x_train_textdata, maxlen=max_len)\n\nx_test_title = tk.texts_to_sequences(test.link_name.values)\nx_test_title = sequence.pad_sequences(x_test_title, maxlen=max_len)\n\nx_test_textdata = tk.texts_to_sequences(test.textdata.values)\nx_test_textdata = sequence.pad_sequences(x_test_textdata, maxlen=max_len)\n\nword_index = tk.word_index\nytrain_enc = np_utils.to_categorical(y_train)\n\nembeddings_index = {}\nf = open('../data/glove.840B.300d.txt')\nfor line in tqdm(f):\n    values = line.split()\n    word = values[0]\n    coefs = np.asarray(values[1:], dtype='float32')\n    embeddings_index[word] = coefs\nf.close()\n\nembedding_matrix = np.zeros((len(word_index) + 1, 300))\nfor word, i in tqdm(word_index.items()):\n    embedding_vector = embeddings_index.get(word)\n    if embedding_vector is not None:\n        embedding_matrix[i] = embedding_vector\n\nscl = preprocessing.StandardScaler()\ntrain_num_scl = scl.fit_transform(train_num)\ntest_num_scl = scl.transform(test_num)\n\nmodel1 = Sequential()\nmodel1.add(Embedding(len(word_index) + 1,\n                     300,\n                     weights=[embedding_matrix],\n                     input_length=80,\n                     trainable=False))\nmodel1.add(LSTM(300, dropout_W=0.2, dropout_U=0.2))\n\nmodel2 = Sequential()\nmodel2.add(Embedding(len(word_index) + 1,\n                     300,\n                     weights=[embedding_matrix],\n                     input_length=80,\n                     trainable=False))\nmodel2.add(LSTM(300, dropout_W=0.2, dropout_U=0.2))\n\nmodel3 = Sequential()\nmodel3.add(Dense(100, input_dim=train_num_scl.shape[1]))\nmodel3.add(PReLU())\nmodel3.add(Dropout(0.2))\nmodel3.add(BatchNormalization())\n\nmodel3.add(Dense(100))\nmodel3.add(PReLU())\nmodel3.add(Dropout(0.2))\nmodel3.add(BatchNormalization())\n\nmerged_model = Sequential()\nmerged_model.add(Merge([model1, model2, model3], mode='concat'))\nmerged_model.add(BatchNormalization())\n\nmerged_model.add(Dense(200))\nmerged_model.add(PReLU())\nmerged_model.add(Dropout(0.2))\nmerged_model.add(BatchNormalization())\n\nmerged_model.add(Dense(200))\nmerged_model.add(PReLU())\nmerged_model.add(Dropout(0.2))\nmerged_model.add(BatchNormalization())\n\nmerged_model.add(Dense(2))\nmerged_model.add(Activation('softmax'))\n\nmerged_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy', 'precision', 'recall'])\n\ncheckpoint = ModelCheckpoint('../data/weights_title+content_tdd.h5', monitor='val_acc', save_best_only=True, verbose=2)\n\nmerged_model.fit([x_train_title, x_train_textdata, train_num_scl], y=ytrain_enc,\n                 batch_size=128, nb_epoch=200, verbose=2, validation_split=0.1,\n                 shuffle=True, callbacks=[checkpoint])\n"
  },
  {
    "path": "deepnets/LSTM_Title_Content_with_GloVe.py",
    "content": "# coding: utf-8\n\"\"\"\nLSTM with title+content text with GloVe embeddings\n@author: Abhishek Thakur\n\"\"\"\nimport pandas as pd\nimport numpy as np\nfrom tqdm import tqdm\nfrom keras.models import Sequential\nfrom keras.layers.core import Dense, Activation, Dropout\nfrom keras.layers.embeddings import Embedding\nfrom keras.layers.recurrent import LSTM\nfrom keras.layers.normalization import BatchNormalization\nfrom keras.utils import np_utils\nfrom keras.engine.topology import Merge\nfrom keras.callbacks import ModelCheckpoint\nfrom keras.layers.advanced_activations import PReLU\nfrom keras.preprocessing import sequence, text\n\ntrain = pd.read_csv('../data/train.csv')\ntest = pd.read_csv('../data/test.csv')\n\ny_train = train.label.values\ny_test = test.label.values\n\ntk = text.Tokenizer(nb_words=200000)\ntrain.link_name = train.link_name.astype(str)\ntest.link_name = test.link_name.astype(str)\n\ntrain.textdata = train.textdata.astype(str)\ntest.textdata = test.textdata.astype(str)\n\nmax_len = 80\n\ntk.fit_on_texts(list(train.link_name.values) + list(train.textdata.values) + list(test.link_name.values) + list(\n    test.textdata.values))\nx_train_title = tk.texts_to_sequences(train.link_name.values)\nx_train_title = sequence.pad_sequences(x_train_title, maxlen=max_len)\n\nx_train_textdata = tk.texts_to_sequences(train.textdata.values)\nx_train_textdata = sequence.pad_sequences(x_train_textdata, maxlen=max_len)\n\nx_test_title = tk.texts_to_sequences(test.link_name.values)\nx_test_title = sequence.pad_sequences(x_test_title, maxlen=max_len)\n\nx_test_textdata = tk.texts_to_sequences(test.textdata.values)\nx_test_textdata = sequence.pad_sequences(x_test_textdata, maxlen=max_len)\n\nword_index = tk.word_index\nytrain_enc = np_utils.to_categorical(y_train)\n\nembeddings_index = {}\nf = open('../data/glove.840B.300d.txt')\nfor line in tqdm(f):\n    values = line.split()\n    word = values[0]\n    coefs = np.asarray(values[1:], dtype='float32')\n    embeddings_index[word] = coefs\nf.close()\n\nembedding_matrix = np.zeros((len(word_index) + 1, 300))\nfor word, i in tqdm(word_index.items()):\n    embedding_vector = embeddings_index.get(word)\n    if embedding_vector is not None:\n        embedding_matrix[i] = embedding_vector\n\nmodel1 = Sequential()\nmodel1.add(Embedding(len(word_index) + 1,\n                     300,\n                     weights=[embedding_matrix],\n                     input_length=80,\n                     trainable=False))\nmodel1.add(LSTM(300, dropout_W=0.2, dropout_U=0.2))\n\nmodel2 = Sequential()\nmodel2.add(Embedding(len(word_index) + 1,\n                     300,\n                     weights=[embedding_matrix],\n                     input_length=80,\n                     trainable=False))\nmodel2.add(LSTM(300, dropout_W=0.2, dropout_U=0.2))\n\nmerged_model = Sequential()\nmerged_model.add(Merge([model1, model2], mode='concat'))\nmerged_model.add(BatchNormalization())\n\nmerged_model.add(Dense(200))\nmerged_model.add(PReLU())\nmerged_model.add(Dropout(0.2))\nmerged_model.add(BatchNormalization())\n\nmerged_model.add(Dense(200))\nmerged_model.add(PReLU())\nmerged_model.add(Dropout(0.2))\nmerged_model.add(BatchNormalization())\n\nmerged_model.add(Dense(2))\nmerged_model.add(Activation('softmax'))\n\nmerged_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy', 'precision', 'recall'])\ncheckpoint = ModelCheckpoint('../data/weights.h5', monitor='val_acc', save_best_only=True, verbose=2)\n\nmerged_model.fit([x_train_title, x_train_textdata], y=ytrain_enc,\n                 batch_size=128, nb_epoch=200, verbose=2, validation_split=0.1,\n                 shuffle=True, callbacks=[checkpoint])\n"
  },
  {
    "path": "deepnets/LSTM_Titles.py",
    "content": "# coding: utf-8\n\"\"\"\nSimple LSTM only on Titles\n@author: Abhishek Thakur\n\"\"\"\n\nimport pandas as pd\nfrom keras.models import Sequential\nfrom keras.layers.core import Dense, Activation, Dropout\nfrom keras.layers.embeddings import Embedding\nfrom keras.layers.recurrent import LSTM\nfrom keras.layers.normalization import BatchNormalization\nfrom keras.utils import np_utils\nfrom keras.callbacks import ModelCheckpoint\nfrom keras.layers.advanced_activations import PReLU\nfrom keras.preprocessing import sequence, text\n\ntrain = pd.read_csv('../data/train.csv')\ntest = pd.read_csv('../data/test.csv')\n\ny_train = train.label.values\ny_test = test.label.values\n\ntk = text.Tokenizer(nb_words=200000)\ntrain.link_name = train.link_name.astype(str)\ntest.link_name = test.link_name.astype(str)\ntrain.textdata = train.textdata.astype(str)\ntest.textdata = test.textdata.astype(str)\n\nmax_len = 80\n\ntk.fit_on_texts(list(train.link_name.values) + list(train.textdata.values) + list(test.link_name.values) + list(\n    test.textdata.values))\nx_train_title = tk.texts_to_sequences(train.link_name.values)\nx_train_title = sequence.pad_sequences(x_train_title, maxlen=max_len)\n\nx_train_textdata = tk.texts_to_sequences(train.textdata.values)\nx_train_textdata = sequence.pad_sequences(x_train_textdata, maxlen=max_len)\n\nx_test_title = tk.texts_to_sequences(test.link_name.values)\nx_test_title = sequence.pad_sequences(x_test_title, maxlen=max_len)\n\nx_test_textdata = tk.texts_to_sequences(test.textdata.values)\nx_test_textdata = sequence.pad_sequences(x_test_textdata, maxlen=max_len)\n\nword_index = tk.word_index\nytrain_enc = np_utils.to_categorical(y_train)\n\nmodel = Sequential()\nmodel.add(Embedding(len(word_index) + 1, 300, input_length=80, dropout=0.2))\nmodel.add(LSTM(300, dropout_W=0.2, dropout_U=0.2))\n\nmodel.add(Dense(200))\nmodel.add(PReLU())\nmodel.add(Dropout(0.2))\nmodel.add(BatchNormalization())\n\nmodel.add(Dense(200))\nmodel.add(PReLU())\nmodel.add(Dropout(0.2))\nmodel.add(BatchNormalization())\n\nmodel.add(Dense(2))\nmodel.add(Activation('softmax'))\n\nmodel.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy', 'precision', 'recall'])\n\ncheckpoint = ModelCheckpoint('../data/weights.h5', monitor='val_acc', save_best_only=True, verbose=2)\n\nmodel.fit(x_train_title, y=ytrain_enc,\n                 batch_size=128, nb_epoch=200, verbose=2, validation_split=0.1,\n                 shuffle=True, callbacks=[checkpoint])\n"
  },
  {
    "path": "deepnets/TDD_Title_Content_with_GloVe.py",
    "content": "\n# coding: utf-8\n\"\"\"\nTime distributed dense with GloVe embeddings (title + content text)\n@author: Abhishek Thakur\n\"\"\"\nimport pandas as pd\nimport numpy as np\nfrom tqdm import tqdm\nfrom keras.models import Sequential\nfrom keras.layers.core import Dense, Activation, Dropout\nfrom keras.layers.embeddings import Embedding\nfrom keras.layers.normalization import BatchNormalization\nfrom keras.utils import np_utils\nfrom keras.engine.topology import Merge\nfrom keras.layers import TimeDistributed, Lambda\nfrom keras.callbacks import ModelCheckpoint\nfrom keras import backend as K\nfrom keras.layers.advanced_activations import PReLU\nfrom keras.preprocessing import sequence, text\n\ntrain = pd.read_csv('../data/train.csv')\ntest = pd.read_csv('../data/test.csv')\n\ny_train = train.label.values\ny_test = test.label.values\n\ntk = text.Tokenizer(nb_words=200000)\n\ntrain.link_name = train.link_name.astype(str)\ntest.link_name = test.link_name.astype(str)\n\ntrain.textdata = train.textdata.astype(str)\ntest.textdata = test.textdata.astype(str)\n\nmax_len = 80\n\ntk.fit_on_texts(list(train.link_name.values) + list(train.textdata.values) + list(test.link_name.values) + list(test.textdata.values))\nx_train_title = tk.texts_to_sequences(train.link_name.values)\nx_train_title = sequence.pad_sequences(x_train_title, maxlen=max_len)\n\nx_train_textdata = tk.texts_to_sequences(train.textdata.values)\nx_train_textdata = sequence.pad_sequences(x_train_textdata, maxlen=max_len)\n\n\nx_test_title = tk.texts_to_sequences(test.link_name.values)\nx_test_title = sequence.pad_sequences(x_test_title, maxlen=max_len)\n\nx_test_textdata = tk.texts_to_sequences(test.textdata.values)\nx_test_textdata = sequence.pad_sequences(x_test_textdata, maxlen=max_len)\n\nword_index = tk.word_index\nytrain_enc = np_utils.to_categorical(y_train)\n\nembeddings_index = {}\nf = open('../data/glove.840B.300d.txt')\nfor line in tqdm(f):\n    values = line.split()\n    word = values[0]\n    coefs = np.asarray(values[1:], dtype='float32')\n    embeddings_index[word] = coefs\nf.close()\n\nembedding_matrix = np.zeros((len(word_index) + 1, 300))\nfor word, i in tqdm(word_index.items()):\n    embedding_vector = embeddings_index.get(word)\n    if embedding_vector is not None:\n        embedding_matrix[i] = embedding_vector\n\nmodel1 = Sequential()\nmodel1.add(Embedding(len(word_index) + 1,\n                     300,\n                     weights=[embedding_matrix],\n                     input_length=80,\n                     trainable=False))\nmodel1.add(TimeDistributed(Dense(300, activation='relu')))\nmodel1.add(Lambda(lambda x: K.sum(x, axis=1), output_shape=(300,)))\n\nmodel2 = Sequential()\nmodel2.add(Embedding(len(word_index) + 1,\n                     300,\n                     weights=[embedding_matrix],\n                     input_length=80,\n                     trainable=False))\nmodel2.add(TimeDistributed(Dense(300, activation='relu')))\nmodel2.add(Lambda(lambda x: K.sum(x, axis=1), output_shape=(300,)))\n\n\nmerged_model = Sequential()\nmerged_model.add(Merge([model1, model2], mode='concat'))\nmerged_model.add(BatchNormalization())\n\nmerged_model.add(Dense(200))\nmerged_model.add(PReLU())\nmerged_model.add(Dropout(0.2))\nmerged_model.add(BatchNormalization())\n\nmerged_model.add(Dense(200))\nmerged_model.add(PReLU())\nmerged_model.add(Dropout(0.2))\nmerged_model.add(BatchNormalization())\n\nmerged_model.add(Dense(2))\nmerged_model.add(Activation('softmax'))\n\nmerged_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy', 'precision', 'recall'])\ncheckpoint = ModelCheckpoint('../data/weights.h5', monitor='val_acc', save_best_only=True, verbose=2)\n\nmerged_model.fit([x_train_title, x_train_textdata], y=ytrain_enc, \n                 batch_size=128, nb_epoch=200, verbose=2, validation_split=0.1, \n                 shuffle=True, callbacks=[checkpoint])\n"
  }
]