Repository: oxford-cs-deepnlp-2017/practical-1
Branch: master
Commit: eb157b354077
Files: 6
Total size: 71.1 MB

Directory structure:
gitextract_zhmwk546/

├── .gitattributes
├── .gitignore
├── README.md
├── install-python.sh
├── practical1.ipynb
└── ted_en-20160408.xml

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitattributes
================================================


================================================
FILE: .gitignore
================================================
*.zip


================================================
FILE: README.md
================================================
# Practical 1: word2vec
[Brendan Shillingford, Yannis Assael, Chris Dyer]

For this practical, you'll be provided with a partially-complete IPython notebook, an interactive web-based Python computing environment that allows us to mix text, code, and interactive plots.

We will be training word2vec models on TED Talk and Wikipedia data, using the word2vec implementation included in the Python package `gensim`. After training the models, we will analyze and visualize the learned embeddings.

## Setup and installation
On a lab workstation, clone the practical repository and run the `. install-python.sh` shell script in a terminal to install Anaconda with Python 3, and the packages required for this practical.

Run `ipython notebook` in the repository directory and open the `practical.ipynb` notebook in your browser.

## Preliminaries

### Preprocessing
The code for downloading the dataset and preprocessing it is prewritten to save time. However, it is expected that you'll need to perform such a task in future practicals, given raw data. Read it and make sure you understand it. Often, one uses a library like `nltk` to simplify this task, but we have not done so here and instead opted to use regular expressions via Python's `re` module.

### Word frequencies
Make a list of the most common words and their occurence counts. Take a look at the top 40 words. You may want to use the `sklearn.feature_extraction.text` module's `CountVectorizer` class or the `collections` module's `Counter` class.

Take the top 1000 words, and plot a histogram of their counts. The plotting code for an interactive histogram is already given in the notebook.

**Handin:** show the frequency distribution histogram.

## Training Word2Vec
Now that we have a processed list of sentences, let's run the word2vec training. Begin by reading the gensim documentation for word2vec at <https://radimrehurek.com/gensim/models/word2vec.html>, to figure out how to use the `Word2Vec` class. Learn embeddings in $\mathbb R^{100}$ using CBOW (which is the default). Other options should be default except `min_count=10` so that infrequent words are ignored. The training process should take under half a minute.

If your trained `Word2Vec` instance is called `model_ted`, you should be able to check the vocabulary size using `len(model_ted.vocab)`, which should be around 14427. Try using the `most_similar()` method to return a list of the most similar words to "man" and "computer".

**Handin:** find a few more words with interesting and/or surprising nearest neighbours.

**Handin:** find an interesting cluster in the t-SNE plot.

**Optional, for enthusiastic students:** try manually retrieving two word vectors using the indexing operator as described in gensim's documentation, then computer their cosine distances (recall it is defined as $d(x,y) = \frac{\langle x, y \rangle}{\|x\|\|y\|}$). You may be interested in `np.dot()` and `np.linalg.norm()`, see the numpy documentation for details. Compare this to the distance computed by gensim's functions.

## Comparison to vectors trained on WikiText-2 data
We have provided downloading/preprocessing code (similar to the previous code) for the WikiText-2 dataset. The code uses a random subsample of the data so it is comparable in size to the TED Talk data.

Repeat the same analysis as above but on this dataset.

**Handin:** find a few words with similar nearest neighbours.

**Handin:** find an interesting cluster in the t-SNE plot.

**Handin:** Are there any notable differences between the embeddings learned on data compared to those learned on the TED Talk data?


## (Optional, for enthusiastic students) Clustering
If you have extra time, try performing a k-means clustering (e.g. using `sklearn.cluster.kmeans`) on the embeddings, tuning the number of clusters until you get interesting or meaningful clusters.

# Handin
See the bolded "**Handin:**" parts above. On paper or verbally, show a practical demonstrator your response to these to get signed off.


================================================
FILE: install-python.sh
================================================
#!/bin/bash

##############################
# check if installation is already done in user's dir
USER_DIR=/home/scratch/$USER/anaconda3
if [ -d $USER_DIR ] ; then
    echo You already have a directory called $USER_DIR
    echo You have either already installed it, or you can remove this
    echo directory and rerun this script if you are trying to reinstall.
    exit 1
fi

##############################
# install python

echo Installing python 3.5

# Go to tmp
cd /tmp/

# Download Anaconda python 3.5
wget --quiet https://repo.continuum.io/archive/Anaconda3-4.2.0-Linux-x86_64.sh
bash Anaconda3-4.2.0-Linux-x86_64.sh -b -p $USER_DIR
rm Anaconda3-4.2.0-Linux-x86_64.sh

# Add anaconda3 path
printf "export PATH='%s/bin':\$PATH" "$USER_DIR" >> ~/.bashrc
source ~/.bashrc

# Update anaconda
conda update -y conda

# Install word2vec dependencies
pip install gensim

echo
echo Anaconda python 3.5 has been installed
echo

================================================
FILE: practical1.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Practical 1: word2vec\n",
    "<p>Oxford CS - Deep NLP 2017<br>\n",
    "https://www.cs.ox.ac.uk/teaching/courses/2016-2017/dl/</p>\n",
    "<p>[Yannis Assael, Brendan Shillingford, Chris Dyer]</p>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This practical is presented as an IPython Notebook, with the code written for recent versions of **Python 3**. The code in this practical will not work with Python 2 unless you modify it. If you are using your own Python installation, ensure you have a setup identical to that described in the installation shell script (which is intended for use with the department lab machines). We will be unable to support installation on personal machines due to time constraints, so please use the lab machines and the setup script if you are unfamiliar with how to install Anaconda.\n",
    "\n",
    "To execute a notebook cell, press `shift-enter`. The return value of the last command will be displayed, if it is not `None`.\n",
    "\n",
    "Potentially useful library documentation, references, and resources:\n",
    "\n",
    "* IPython notebooks: <https://ipython.org/ipython-doc/3/notebook/notebook.html#introduction>\n",
    "* Numpy numerical array library: <https://docs.scipy.org/doc/>\n",
    "* Gensim's word2vec: <https://radimrehurek.com/gensim/models/word2vec.html>\n",
    "* Bokeh interactive plots: <http://bokeh.pydata.org/en/latest/> (we provide plotting code here, but click the thumbnails for more examples to copy-paste)\n",
    "* scikit-learn ML library (aka `sklearn`): <http://scikit-learn.org/stable/documentation.html>\n",
    "* nltk NLP toolkit: <http://www.nltk.org/>\n",
    "* tutorial for processing xml in python using `lxml`: <http://lxml.de/tutorial.html> (we did this for you below, but in case you need it in the future)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import os\n",
    "from random import shuffle\n",
    "import re"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bokeh.models import ColumnDataSource, LabelSet\n",
    "from bokeh.plotting import figure, show, output_file\n",
    "from bokeh.io import output_notebook\n",
    "output_notebook()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part 0: Download the TED dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import urllib.request\n",
    "import zipfile\n",
    "import lxml.etree"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Download the dataset if it's not already there: this may take a minute as it is 75MB\n",
    "if not os.path.isfile('ted_en-20160408.xml'):\n",
    "    urllib.request.urlretrieve(\"https://github.com/oxford-cs-deepnlp-2017/practical-1/blob/master/ted_en-20160408.xml?raw=true\", filename=\"ted_en-20160408.xml\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# For now, we're only interested in the subtitle text, so let's extract that from the XML:\n",
    "doc = lxml.etree.parse('ted_en-20160408.xml')\n",
    "input_text = '\\n'.join(doc.xpath('//content/text()'))\n",
    "del doc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part 1: Preprocessing\n",
    "\n",
    "In this part, we attempt to clean up the raw subtitles a bit, so that we get only sentences. The following substring shows examples of what we're trying to get rid of. Since it's hard to define precisely what we want to get rid of, we'll just use some simple heuristics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "i = input_text.find(\"Hyowon Gweon: See this?\")\n",
    "input_text[i-20:i+150]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's start by removing all parenthesized strings using a regex:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "input_text_noparens = re.sub(r'\\([^)]*\\)', '', input_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can verify the same location in the text is now clean as follows. We won't worry about the irregular spaces since we'll later split the text into sentences and tokenize it anyway."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "i = input_text_noparens.find(\"Hyowon Gweon: See this?\")\n",
    "input_text_noparens[i-20:i+150]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's attempt to remove speakers' names that occur at the beginning of a line, by deleting pieces of the form \"`<up to 20 characters>:`\", as shown in this example. Of course, this is an imperfect heuristic. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sentences_strings_ted = []\n",
    "for line in input_text_noparens.split('\\n'):\n",
    "    m = re.match(r'^(?:(?P<precolon>[^:]{,20}):)?(?P<postcolon>.*)$', line)\n",
    "    sentences_strings_ted.extend(sent for sent in m.groupdict()['postcolon'].split('.') if sent)\n",
    "\n",
    "# Uncomment if you need to save some RAM: these strings are about 50MB.\n",
    "# del input_text, input_text_noparens\n",
    "\n",
    "# Let's view the first few:\n",
    "sentences_strings_ted[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have sentences, we're ready to tokenize each of them into words. This tokenization is imperfect, of course. For instance, how many tokens is \"can't\", and where/how do we split it? We'll take the simplest naive approach of splitting on spaces. Before splitting, we remove non-alphanumeric characters, such as punctuation. You may want to consider the following question: why do we replace these characters with spaces rather than deleting them? Think of a case where this yields a different answer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sentences_ted = []\n",
    "for sent_str in sentences_strings_ted:\n",
    "    tokens = re.sub(r\"[^a-z0-9]+\", \" \", sent_str.lower()).split()\n",
    "    sentences_ted.append(tokens)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Two sample processed sentences:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(sentences_ted)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(sentences_ted[0])\n",
    "print(sentences_ted[1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part 2: Word Frequencies"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you store the counts of the top 1000 words in a list called `counts_ted_top1000`, the code below will plot the histogram requested in the writeup."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plot distribution of top-1000 words"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hist, edges = np.histogram(counts_ted_top1000, density=True, bins=100, normed=True)\n",
    "\n",
    "p = figure(tools=\"pan,wheel_zoom,reset,save\",\n",
    "           toolbar_location=\"above\",\n",
    "           title=\"Top-1000 words distribution\")\n",
    "p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], line_color=\"#555555\")\n",
    "show(p)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part 3: Train Word2Vec"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from gensim.models import Word2Vec"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part 4: Ted Learnt Representations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finding similar words: (see gensim docs for more functionality of `most_similar`)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_ted.most_similar(\"man\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_ted.most_similar(\"computer\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### t-SNE visualization\n",
    "To use the t-SNE code below, first put a list of the top 1000 words (as strings) into a variable `words_top_ted`. The following code gets the corresponding vectors from the model, assuming it's called `model_ted`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This assumes words_top_ted is a list of strings, the top 1000 words\n",
    "words_top_vec_ted = model_ted[words_top_ted]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.manifold import TSNE\n",
    "tsne = TSNE(n_components=2, random_state=0)\n",
    "words_top_ted_tsne = tsne.fit_transform(words_top_vec_ted)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p = figure(tools=\"pan,wheel_zoom,reset,save\",\n",
    "           toolbar_location=\"above\",\n",
    "           title=\"word2vec T-SNE for most common words\")\n",
    "\n",
    "source = ColumnDataSource(data=dict(x1=words_top_ted_tsne[:,0],\n",
    "                                    x2=words_top_ted_tsne[:,1],\n",
    "                                    names=words_top_ted))\n",
    "\n",
    "p.scatter(x=\"x1\", y=\"x2\", size=8, source=source)\n",
    "\n",
    "labels = LabelSet(x=\"x1\", y=\"x2\", text=\"names\", y_offset=6,\n",
    "                  text_font_size=\"8pt\", text_color=\"#555555\",\n",
    "                  source=source, text_align='center')\n",
    "p.add_layout(labels)\n",
    "\n",
    "show(p)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part 5: Wiki Learnt Representations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Download dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if not os.path.isfile('wikitext-103-raw-v1.zip'):\n",
    "    urllib.request.urlretrieve(\"https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip\", filename=\"wikitext-103-raw-v1.zip\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with zipfile.ZipFile('wikitext-103-raw-v1.zip', 'r') as z:\n",
    "    input_text = str(z.open('wikitext-103-raw/wiki.train.raw', 'r').read(), encoding='utf-8') # Thanks Robert Bastian"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Preprocess sentences (note that it's important to remove small sentences for performance)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sentences_wiki = []\n",
    "for line in input_text.split('\\n'):\n",
    "    s = [x for x in line.split('.') if x and len(x.split()) >= 5]\n",
    "    sentences_wiki.extend(s)\n",
    "    \n",
    "for s_i in range(len(sentences_wiki)):\n",
    "    sentences_wiki[s_i] = re.sub(\"[^a-z]\", \" \", sentences_wiki[s_i].lower())\n",
    "    sentences_wiki[s_i] = re.sub(r'\\([^)]*\\)', '', sentences_wiki[s_i])\n",
    "del input_text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# sample 1/5 of the data\n",
    "shuffle(sentences_wiki)\n",
    "print(len(sentences_wiki))\n",
    "sentences_wiki = sentences_wiki[:int(len(sentences_wiki)/5)]\n",
    "print(len(sentences_wiki))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, repeat all the same steps that you performed above. You should be able to reuse essentially all the code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### t-SNE visualization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This assumes words_top_wiki is a list of strings, the top 1000 words\n",
    "words_top_vec_wiki = model_wiki[words_top_wiki]\n",
    "\n",
    "tsne = TSNE(n_components=2, random_state=0)\n",
    "words_top_wiki_tsne = tsne.fit_transform(words_top_vec_wiki)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p = figure(tools=\"pan,wheel_zoom,reset,save\",\n",
    "           toolbar_location=\"above\",\n",
    "           title=\"word2vec T-SNE for most common words\")\n",
    "\n",
    "source = ColumnDataSource(data=dict(x1=words_top_wiki_tsne[:,0],\n",
    "                                    x2=words_top_wiki_tsne[:,1],\n",
    "                                    names=words_top_wiki))\n",
    "\n",
    "p.scatter(x=\"x1\", y=\"x2\", size=8, source=source)\n",
    "\n",
    "labels = LabelSet(x=\"x1\", y=\"x2\", text=\"names\", y_offset=6,\n",
    "                  text_font_size=\"8pt\", text_color=\"#555555\",\n",
    "                  source=source, text_align='center')\n",
    "p.add_layout(labels)\n",
    "\n",
    "show(p)"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}


================================================
FILE: ted_en-20160408.xml
================================================
[File too large to display: 71.1 MB]