Full Code of amanchadha/stanford-cs224n-assignments-2021 for AI

main 20ede2ff6b41 cached

183 files

210.6 MB

5.1M tokens

197 symbols

1 requests

Copy disabled (too large) Download .txt

Showing preview only (22,392K chars total). Download the full file to get everything.

Repository: amanchadha/stanford-cs224n-assignments-2021
Branch: main
Commit: 20ede2ff6b41
Files: 183
Total size: 210.6 MB

Directory structure:
gitextract_w5crukyc/

├── README.md
├── a1/
│   ├── .ipynb_checkpoints/
│   │   └── exploring_word_vectors-checkpoint.ipynb
│   ├── Gensim word vector visualization.ipynb
│   ├── README.txt
│   ├── env.yml
│   ├── exploring_word_vectors.ipynb
│   └── exploring_word_vectors_solved.ipynb
├── a2/
│   ├── collect_submission.sh
│   ├── env.yml
│   ├── get_datasets.sh
│   ├── run.py
│   ├── saved_params_10000.npy
│   ├── saved_params_15000.npy
│   ├── saved_params_20000.npy
│   ├── saved_params_25000.npy
│   ├── saved_params_30000.npy
│   ├── saved_params_35000.npy
│   ├── saved_params_40000.npy
│   ├── saved_params_5000.npy
│   ├── saved_state_10000.pickle
│   ├── saved_state_15000.pickle
│   ├── saved_state_20000.pickle
│   ├── saved_state_25000.pickle
│   ├── saved_state_30000.pickle
│   ├── saved_state_35000.pickle
│   ├── saved_state_40000.pickle
│   ├── saved_state_5000.pickle
│   ├── sgd.py
│   ├── utils/
│   │   ├── __init__.py
│   │   ├── datasets/
│   │   │   └── stanfordSentimentTreebank/
│   │   │       ├── README.txt
│   │   │       ├── SOStr.txt
│   │   │       ├── STree.txt
│   │   │       ├── datasetSentences.txt
│   │   │       ├── datasetSplit.txt
│   │   │       ├── dictionary.txt
│   │   │       ├── original_rt_snippets.txt
│   │   │       └── sentiment_labels.txt
│   │   ├── gradcheck.py
│   │   ├── treebank.py
│   │   └── utils.py
│   └── word2vec.py
├── a3/
│   ├── README.txt
│   ├── collect_submission.sh
│   ├── data/
│   │   ├── dev.conll
│   │   ├── dev.gold.conll
│   │   ├── en-cw.txt
│   │   ├── test.conll
│   │   ├── test.gold.conll
│   │   ├── train.conll
│   │   └── train.gold.conll
│   ├── local_env.yml
│   ├── parser_model.py
│   ├── parser_transitions.py
│   ├── results/
│   │   └── 20210202_021220/
│   │       └── model.weights
│   ├── run.py
│   └── utils/
│       ├── __init__.py
│       ├── general_utils.py
│       └── parser_utils.py
├── a4/
│   ├── README.md
│   ├── __init__.py
│   ├── chr_en_data/
│   │   ├── dev.chr
│   │   ├── dev.en
│   │   ├── test.chr
│   │   ├── test.en
│   │   ├── train.chr
│   │   └── train.en
│   ├── collect_submission.bat
│   ├── collect_submission.sh
│   ├── gpu_requirements.txt
│   ├── local_env.yml
│   ├── model_embeddings.py
│   ├── nmt_model.py
│   ├── outputs/
│   │   ├── .gitignore
│   │   └── test_outputs.txt
│   ├── run.bat
│   ├── run.py
│   ├── run.sh
│   ├── sanity_check.py
│   ├── sanity_check_en_es_data/
│   │   ├── Ybar_t.pkl
│   │   ├── combined_outputs.pkl
│   │   ├── dec_init_state.pkl
│   │   ├── dec_state.pkl
│   │   ├── dev_sanity_check.en
│   │   ├── dev_sanity_check.es
│   │   ├── e_t.pkl
│   │   ├── enc_hiddens.pkl
│   │   ├── enc_hiddens_proj.pkl
│   │   ├── enc_masks.pkl
│   │   ├── o_t.pkl
│   │   ├── step_dec_state_0.pkl
│   │   ├── step_dec_state_1.pkl
│   │   ├── step_dec_state_10.pkl
│   │   ├── step_dec_state_11.pkl
│   │   ├── step_dec_state_12.pkl
│   │   ├── step_dec_state_13.pkl
│   │   ├── step_dec_state_14.pkl
│   │   ├── step_dec_state_15.pkl
│   │   ├── step_dec_state_16.pkl
│   │   ├── step_dec_state_17.pkl
│   │   ├── step_dec_state_18.pkl
│   │   ├── step_dec_state_19.pkl
│   │   ├── step_dec_state_2.pkl
│   │   ├── step_dec_state_20.pkl
│   │   ├── step_dec_state_21.pkl
│   │   ├── step_dec_state_22.pkl
│   │   ├── step_dec_state_3.pkl
│   │   ├── step_dec_state_4.pkl
│   │   ├── step_dec_state_5.pkl
│   │   ├── step_dec_state_6.pkl
│   │   ├── step_dec_state_7.pkl
│   │   ├── step_dec_state_8.pkl
│   │   ├── step_dec_state_9.pkl
│   │   ├── step_o_t_0.pkl
│   │   ├── step_o_t_1.pkl
│   │   ├── step_o_t_10.pkl
│   │   ├── step_o_t_11.pkl
│   │   ├── step_o_t_12.pkl
│   │   ├── step_o_t_13.pkl
│   │   ├── step_o_t_14.pkl
│   │   ├── step_o_t_15.pkl
│   │   ├── step_o_t_16.pkl
│   │   ├── step_o_t_17.pkl
│   │   ├── step_o_t_18.pkl
│   │   ├── step_o_t_19.pkl
│   │   ├── step_o_t_2.pkl
│   │   ├── step_o_t_20.pkl
│   │   ├── step_o_t_21.pkl
│   │   ├── step_o_t_22.pkl
│   │   ├── step_o_t_3.pkl
│   │   ├── step_o_t_4.pkl
│   │   ├── step_o_t_5.pkl
│   │   ├── step_o_t_6.pkl
│   │   ├── step_o_t_7.pkl
│   │   ├── step_o_t_8.pkl
│   │   ├── step_o_t_9.pkl
│   │   ├── target_padded.pkl
│   │   ├── test_sanity_check.en
│   │   ├── test_sanity_check.es
│   │   ├── train_sanity_check.en
│   │   ├── train_sanity_check.es
│   │   └── vocab_sanity_check.json
│   ├── src.model
│   ├── src.vocab
│   ├── test_outputs.txt
│   ├── tgt.model
│   ├── tgt.vocab
│   ├── utils.py
│   ├── vocab.json
│   └── vocab.py
└── a5/
    ├── birth_dev.tsv
    ├── birth_places_train.tsv
    ├── birth_test_inputs.tsv
    ├── collect_submission.sh
    ├── d_cmd
    ├── f_cmd
    ├── g_cmd
    ├── mingpt-demo/
    │   ├── .ipynb_checkpoints/
    │   │   └── play_char-checkpoint.ipynb
    │   ├── LICENSE
    │   ├── README.md
    │   ├── mingpt/
    │   │   ├── __init__.py
    │   │   ├── model.py
    │   │   ├── trainer.py
    │   │   └── utils.py
    │   └── play_char.ipynb
    ├── src/
    │   ├── attention.py
    │   ├── dataset.py
    │   ├── london_baseline.py
    │   ├── model.py
    │   ├── run.py
    │   ├── trainer.py
    │   └── utils.py
    ├── synthesizer.finetune.params
    ├── synthesizer.pretrain.dev.predictions
    ├── synthesizer.pretrain.params
    ├── synthesizer.pretrain.test.predictions
    ├── vanilla.finetune.params
    ├── vanilla.model.params
    ├── vanilla.nopretrain.dev.predictions
    ├── vanilla.nopretrain.test.predictions
    ├── vanilla.pretrain.dev.predictions
    ├── vanilla.pretrain.params
    ├── vanilla.pretrain.test.predictions
    └── wiki.txt

================================================
FILE CONTENTS
================================================

================================================
FILE: README.md
================================================
# Stanford Course CS224n - Natural Language Processing with Deep Learning (Winter 2021)

These are my solutions to the assignments of [CS224n (Natural Language Processing with Deep Learning)](http://web.stanford.edu/class/cs224n/) offered by Stanford University in Winter 2021.
There are five assignments in total. Here is a brief description of each one of these assignments:

## Assignment 1. Word Embeddings

- This assignment [[notebook](a1/exploring_word_vectors_solved.ipynb), [PDF](a1/CS224nAssignment1.pdf)] has two parts which deal with representing words with dense vectors (i.e., word vectors or word embeddings). Word vectors are often used as a fundamental component for downstream NLP tasks, e.g. question answering, text generation, translation, etc., so it is important to build some intuitions as to their strengths and weaknesses. Here, you will explore two types of word vectors: those derived from co-occurrence matrices (which uses SVD), and those derived via GloVe (based on maximum-likelihood training in ML). 

### 1. Count-Based Word Vectors

- Many word vector implementations are driven by the idea that similar words, i.e., (near) synonyms, will be used in similar contexts. As a result, similar words will often be spoken or written along with a shared subset of words, i.e., contexts. By examining these contexts, we can try to develop embeddings for our words. With this intuition in mind, many "old school" approaches to constructing word vectors relied on word counts. Here we elaborate upon one of those strategies, co-occurrence matrices.
- In this part, you will use co-occurence matrices to develop dense vectors for words. A co-occurrence matrix counts how often different terms co-occur in different documents. To derive a co-occurence matrix, we use a window with a fixed size <i>w</i>, and then slide this window over all of the documents. Then, we count how many times two different words <i>v<sub>i</sub></i> and <i>v<sub>j</sub></i> occurs with each other in a window, and put this number in the <i>(i, j)<sup>th</sup></i> entry of the matrix. We then run dimensionality reduction on the co-occurence matrix using singular value decomposition. We then select the top <i>r</i> components after the decomposition and thus, derive <i>r</i>-dimensional embeddings for words.

<p align="center">
<img src="figures/svd.jpg" alt="drawing" width="400"/>
</p>

### 2. Prediction (or Maximum Likelihood)-Based Word Vectors: Word2Vec

- Prediction-based word vectors (which also utilize the benefit of counts) such as Word2Vec and GloVe have demonstrated better performance compared to co-occurence matrices. In this part, you will explore pretrained GloVe embeddings using the [Gensim](https://radimrehurek.com/gensim/) package. Initially, you'll have to reduce the dimensionality of word vectors using SVD from 300 to 2 to be able to vizualize and analyze these vectors. Next, you'll find the closest word vectors to a given word vector. You'll then get to know words with multiple meanings (polysemous words). You will also experiment with the analogy task, introduced for the first time in the Word2Vec paper [(Mikolov et al. 2013)](https://arxiv.org/pdf/1301.3781.pdf%5D). The task is simple: given words <i>x, y</i>, and <i>z</i>, you have to find a word <i>w</i> such that the following relationship holds: <i>x</i> is to <i>y</i> like <i>z</i> is to <i>w</i>. For example, Rome is to Italy like D.C. is to the United States. You will find that solving this task with Word2Vec vectors is easy and is just a simple addition and subtraction of vectors, which is a nice feature of word embeddings.
- If you’re feeling adventurous, challenge yourself and try reading [GloVe’s original paper](https://nlp.stanford.edu/pubs/glove.pdf).

<p align="center">
<img src="figures/analogy.jpg" alt="drawing" width="400"/>
</p>

## Assignment 2. Understanding and Implementing Word2Vec

- In this [assignment](a2/CS224nAssignment2.pdf), you will get familiar with the Word2Vec algorithm. The key insight behind Word2Vec is that "a word is known by the company it keeps". There are two models introduced by the Word2Vec paper based on this idea: (i) the skip-gram model and (ii) the Continuous Bag Of Words (CBOW) model. 
- In this assignment, you'll be implementing the skip-gram model using NumPy. You have to implement the both version of Skip-gram; the first one is with the naive Softmax loss and the second one -- which is much faster -- with the negative sampling loss. Your implementation of the first version is just sanity-checked on a small dataset, but you have to run the second version on the Stanford Sentiment Treebank dataset. 
- It is highly recommend for anyone interested in gaining a deep understanding of Word2Vec to first do the theoretical part of this assignment and only then proceed to the practical part.

<p align="center">
<img src="figures/word2vec.jpg" alt="drawing" width="450"/>
</p>

## Assignment 3. Neural Dependency Parsing

- If you have take a compiler course before, you have definitely heard the term "parsing". This assignment is about "dependency parsing" where you train a model that can specify the dependencies. If you remember "Shift-Reduce Parser" from your Compiler class, then you will find the ideas here quite familiar. The only difference is that we shall use a neural network to find the dependencies. In this assignment, you will build a neural dependency parser using PyTorch. 
- In Part 1 of this [assignment](a3/CS224nAssignment3.pdf), you will learn about two general neural network techniques (Adam Optimization and Dropout). In Part 2 of this assignment, you will implement and train a dependency parser using the techniques from Part 1, before analyzing a few erroneous dependency parses. 
- Both the Adam optimizer and Dropout will be used in the neural dependency parser you are going to implement with PyTorch. The parser will do one of the following three moves: 1) Shift 2) Left-arc 3) Right-arc. You can read more about the details of these three moves in the [handout](a3/a3.pdf) of the assignment. What your network should do is to predict one of these moves at every step. For predicting each move, your model needs features which are going to be extracted from the stack and buffer of each stage (you maintain a stack and buffer during parsing, to know what you have already parsed and what is remaining for parsing, respectively). The good news is that the code for extracting features is given to you to help you just focus on the neural network part! There are lots of hints throughout the assignment -- as this is the first assignment in the course where students work with PyTorch -- that walk you through implementing each part. 

<p align="center">
<img src="figures/dependency-parsing.jpg" alt="drawing" width="450"/>
</p>

## Assignment 4. Seq2Seq Machine Translation Model with Multiplicative Attention

- This [assignment](a4/CS224nAssignment4.pdf) is split into two sections: Neural Machine Translation with RNNs and Analyzing NMT Systems. The first is primarily coding and implementation focused, whereas the second entirely consists of written, analysis questions.

<p align="center">
<img src="figures/nmt.jpg" alt="drawing" width="350"/>
</p>

## Assignment 5. Self-Attention, Transformers, and Pretraining

- This [assignment](a5/CS224nAssignment5.pdf) is an investigation into Transformer self-attention building blocks, and the effects of pre- training. It covers mathematical properties of Transformers and self-attention through written questions. Further, you’ll get experience with practical system-building through repurposing an existing codebase. The assignment is split into a written (mathematical) part and a coding part, with its own written questions. Here’s a quick summary:
  1. Mathematical exploration: What kinds of operations can self-attention easily implement? Why should we use fancier things like multi-headed self-attention? This section will use some mathematical investigations to illuminate a few of the motivations of self-attention and Transformer networks.
  2. Extending a research codebase: In this portion of the assignment, you’ll get some experience and intuition for a cutting-edge research topic in NLP: teaching NLP models facts about the world through pretraining, and accessing that knowledge through finetuning. You’ll train a Transformer model to attempt to answer simple questions of the form “Where was person [x] born?” – without providing any input text from which to draw the answer. You’ll find that models are able to learn some facts about where people were born through pretraining, and access that information during fine-tuning to answer the questions.
- Then, you’ll take a harder look at the system you built, and reason about the implications and concerns about relying on such implicit pretrained knowledge.

<p align="center">
<img src="figures/attn.jpg"/>
</p>

## Handouts

 - [x] Assignment 1: Intro to word vectors [`a1/exploring_word_vectors.ipynb`](a1/exploring_word_vectors.ipynb)
 - [x] Assignment 2: Training Word2Vec [`a2/a2.pdf`](a2/a2.pdf)
 - [x] Assignment 3: Dependency parsing [`a3/a3.pdf`](a3/a3.pdf)
 - [x] Assignment 4: Neural machine translation with seq2seq and attention [`a4/a4.pdf`](a4/a4.pdf)
 - [x] Assignment 5: Neural machine translation with sub-word modeling [`a5/a5.pdf`](a5/a5.pdf)
 - [x] Project: [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) - Stanford Question Asking Dataset [`project/proposal.pdf`](project/proposal.pdf)

## Prerequisites

- Proficiency in Python.
  All class assignments will be in Python (using NumPy and PyTorch). If you need to remind yourself of Python, or you're not very familiar with NumPy, attend the Python review session in week 1 (listed in the schedule). If you have a lot of programming experience but in a different language (e.g. C/C++/Matlab/Java/Javascript), you will probably be fine.
- College Calculus, Linear Algebra (e.g. MATH51, CME100).
  You should be comfortable taking (multivariable) derivatives and understanding matrix/vector notation and operations.
- Basic Probability and Statistics (e.g. CS109 or equivalent)
  You should know basics of probabilities, Gaussian distributions, mean, standard deviation, etc.
- Foundations of Machine Learning (e.g. CS221 or CS229).
  We will be formulating cost functions, taking derivatives and performing optimization with gradient descent. If you already have basic machine learning and/or deep learning knowledge, the course will be easier; however it is possible to take CS224n without it. There are many introductions to ML, in webpage, book, and video form. One approachable introduction is Hal Daumé's in-progress "A Course in Machine Learning". Reading the first 5 chapters of that book would be good background. Knowing the first 7 chapters would be even better!

## Course Description

- This course was formed in 2017 as a merger of the earlier [CS224n](https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1162) (Natural Language Processing) and [CS224d](http://cs224d.stanford.edu/) (Natural Language Processing with Deep Learning) courses. Below you can find archived websites and student project reports.
- Natural language processing (NLP) is one of the most important technologies of the information age, and a crucial part of artificial intelligence. Applications of NLP are everywhere because people communicate almost everything in language: web search, advertising, emails, customer service, language translation, medical reports, etc. In recent years, Deep Learning approaches have obtained very high performance across many different NLP tasks, using single end-to-end neural models that do not require traditional, task-specific feature engineering. In this course, students will gain a thorough introduction to cutting-edge research in Deep Learning for NLP. Through lectures, assignments and a final project, students will learn the necessary skills to design, implement, and understand their own neural network models. CS224n uses PyTorch.

## Resources

Lecture notes, assignments and other materials can be downloaded from the [course webpage](http://web.stanford.edu/class/cs224n/).
- Lectures: [YouTube](https://www.youtube.com/watch?v=8rXD5-xhemo&list=PLoROMvodv4rOhcuXMZkNm7j3fVwBBY42z)
- Schedule: [Stanford CS224n](http://web.stanford.edu/class/cs224n/index.html#schedule)
- Projects: [Stanford CS224n](http://web.stanford.edu/class/cs224n/project.html)

## Disclaimer

I recognize the hard time people spend on building intuition, understanding new concepts and debugging assignments. The solutions uploaded here are **only for reference**. They are meant to unblock you if you get stuck somewhere. Please do not copy any part of the solutions as-is (the assignments are fairly easy if you read the instructions carefully).

================================================
FILE: a1/.ipynb_checkpoints/exploring_word_vectors-checkpoint.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# CS224N Assignment 1: Exploring Word Vectors (25 Points)\n",
    "### <font color='blue'> Due 4:30pm, Tue Jan 19 </font>\n",
    "\n",
    "Welcome to CS224N! \n",
    "\n",
    "Before you start, make sure you read the README.txt in the same directory as this notebook for important setup information. A lot of code is provided in this notebook, and we highly encourage you to read and understand it as part of the learning :)\n",
    "\n",
    "If you aren't super familiar with Python, Numpy, or Matplotlib, we recommend you check out the review session on Friday. The session will be recorded and the material will be made available on our [website](http://web.stanford.edu/class/cs224n/index.html#schedule). The CS231N Python/Numpy [tutorial](https://cs231n.github.io/python-numpy-tutorial/) is also a great resource.\n",
    "\n",
    "\n",
    "**Assignment Notes:** Please make sure to save the notebook as you go along. Submission Instructions are located at the bottom of the notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package reuters to /Users/Aman/nltk_data...\n",
      "[nltk_data]   Package reuters is already up-to-date!\n"
     ]
    }
   ],
   "source": [
    "# All Import Statements Defined Here\n",
    "# Note: Do not add to this list.\n",
    "# ----------------\n",
    "\n",
    "import sys\n",
    "assert sys.version_info[0]==3\n",
    "assert sys.version_info[1] >= 5\n",
    "\n",
    "from gensim.models import KeyedVectors\n",
    "from gensim.test.utils import datapath\n",
    "import pprint\n",
    "import matplotlib.pyplot as plt\n",
    "plt.rcParams['figure.figsize'] = [10, 5]\n",
    "import nltk\n",
    "nltk.download('reuters')\n",
    "from nltk.corpus import reuters\n",
    "import numpy as np\n",
    "import random\n",
    "import scipy as sp\n",
    "from sklearn.decomposition import TruncatedSVD\n",
    "from sklearn.decomposition import PCA\n",
    "\n",
    "START_TOKEN = '<START>'\n",
    "END_TOKEN = '<END>'\n",
    "\n",
    "np.random.seed(0)\n",
    "random.seed(0)\n",
    "# ----------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Word Vectors\n",
    "\n",
    "Word Vectors are often used as a fundamental component for downstream NLP tasks, e.g. question answering, text generation, translation, etc., so it is important to build some intuitions as to their strengths and weaknesses. Here, you will explore two types of word vectors: those derived from *co-occurrence matrices*, and those derived via *GloVe*. \n",
    "\n",
    "**Note on Terminology:** The terms \"word vectors\" and \"word embeddings\" are often used interchangeably. The term \"embedding\" refers to the fact that we are encoding aspects of a word's meaning in a lower dimensional space. As [Wikipedia](https://en.wikipedia.org/wiki/Word_embedding) states, \"*conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with a much lower dimension*\"."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 1: Count-Based Word Vectors (10 points)\n",
    "\n",
    "Most word vector models start from the following idea:\n",
    "\n",
    "*You shall know a word by the company it keeps ([Firth, J. R. 1957:11](https://en.wikipedia.org/wiki/John_Rupert_Firth))*\n",
    "\n",
    "Many word vector implementations are driven by the idea that similar words, i.e., (near) synonyms, will be used in similar contexts. As a result, similar words will often be spoken or written along with a shared subset of words, i.e., contexts. By examining these contexts, we can try to develop embeddings for our words. With this intuition in mind, many \"old school\" approaches to constructing word vectors relied on word counts. Here we elaborate upon one of those strategies, *co-occurrence matrices* (for more information, see [here](http://web.stanford.edu/class/cs124/lec/vectorsemantics.video.pdf) or [here](https://medium.com/data-science-group-iitr/word-embedding-2d05d270b285))."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Co-Occurrence\n",
    "\n",
    "A co-occurrence matrix counts how often things co-occur in some environment. Given some word $w_i$ occurring in the document, we consider the *context window* surrounding $w_i$. Supposing our fixed window size is $n$, then this is the $n$ preceding and $n$ subsequent words in that document, i.e. words $w_{i-n} \\dots w_{i-1}$ and $w_{i+1} \\dots w_{i+n}$. We build a *co-occurrence matrix* $M$, which is a symmetric word-by-word matrix in which $M_{ij}$ is the number of times $w_j$ appears inside $w_i$'s window among all documents.\n",
    "\n",
    "**Example: Co-Occurrence with Fixed Window of n=1**:\n",
    "\n",
    "Document 1: \"all that glitters is not gold\"\n",
    "\n",
    "Document 2: \"all is well that ends well\"\n",
    "\n",
    "\n",
    "|     *    | `<START>` | all | that | glitters | is   | not  | gold  | well | ends | `<END>` |\n",
    "|----------|-------|-----|------|----------|------|------|-------|------|------|-----|\n",
    "| `<START>`    | 0     | 2   | 0    | 0        | 0    | 0    | 0     | 0    | 0    | 0   |\n",
    "| all      | 2     | 0   | 1    | 0        | 1    | 0    | 0     | 0    | 0    | 0   |\n",
    "| that     | 0     | 1   | 0    | 1        | 0    | 0    | 0     | 1    | 1    | 0   |\n",
    "| glitters | 0     | 0   | 1    | 0        | 1    | 0    | 0     | 0    | 0    | 0   |\n",
    "| is       | 0     | 1   | 0    | 1        | 0    | 1    | 0     | 1    | 0    | 0   |\n",
    "| not      | 0     | 0   | 0    | 0        | 1    | 0    | 1     | 0    | 0    | 0   |\n",
    "| gold     | 0     | 0   | 0    | 0        | 0    | 1    | 0     | 0    | 0    | 1   |\n",
    "| well     | 0     | 0   | 1    | 0        | 1    | 0    | 0     | 0    | 1    | 1   |\n",
    "| ends     | 0     | 0   | 1    | 0        | 0    | 0    | 0     | 1    | 0    | 0   |\n",
    "| `<END>`      | 0     | 0   | 0    | 0        | 0    | 0    | 1     | 1    | 0    | 0   |\n",
    "\n",
    "**Note:** In NLP, we often add `<START>` and `<END>` tokens to represent the beginning and end of sentences, paragraphs or documents. In thise case we imagine `<START>` and `<END>` tokens encapsulating each document, e.g., \"`<START>` All that glitters is not gold `<END>`\", and include these tokens in our co-occurrence counts.\n",
    "\n",
    "The rows (or columns) of this matrix provide one type of word vectors (those based on word-word co-occurrence), but the vectors will be large in general (linear in the number of distinct words in a corpus). Thus, our next step is to run *dimensionality reduction*. In particular, we will run *SVD (Singular Value Decomposition)*, which is a kind of generalized *PCA (Principal Components Analysis)* to select the top $k$ principal components. Here's a visualization of dimensionality reduction with SVD. In this picture our co-occurrence matrix is $A$ with $n$ rows corresponding to $n$ words. We obtain a full matrix decomposition, with the singular values ordered in the diagonal $S$ matrix, and our new, shorter length-$k$ word vectors in $U_k$.\n",
    "\n",
    "![Picture of an SVD](imgs/svd.png \"SVD\")\n",
    "\n",
    "This reduced-dimensionality co-occurrence representation preserves semantic relationships between words, e.g. *doctor* and *hospital* will be closer than *doctor* and *dog*. \n",
    "\n",
    "**Notes:** If you can barely remember what an eigenvalue is, here's [a slow, friendly introduction to SVD](https://davetang.org/file/Singular_Value_Decomposition_Tutorial.pdf). If you want to learn more thoroughly about PCA or SVD, feel free to check out lectures [7](https://web.stanford.edu/class/cs168/l/l7.pdf), [8](http://theory.stanford.edu/~tim/s15/l/l8.pdf), and [9](https://web.stanford.edu/class/cs168/l/l9.pdf) of CS168. These course notes provide a great high-level treatment of these general purpose algorithms. Though, for the purpose of this class, you only need to know how to extract the k-dimensional embeddings by utilizing pre-programmed implementations of these algorithms from the numpy, scipy, or sklearn python packages. In practice, it is challenging to apply full SVD to large corpora because of the memory needed to perform PCA or SVD. However, if you only want the top $k$ vector components for relatively small $k$ — known as [Truncated SVD](https://en.wikipedia.org/wiki/Singular_value_decomposition#Truncated_SVD) — then there are reasonably scalable techniques to compute those iteratively."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Plotting Co-Occurrence Word Embeddings\n",
    "\n",
    "Here, we will be using the Reuters (business and financial news) corpus. If you haven't run the import cell at the top of this page, please run it now (click it and press SHIFT-RETURN). The corpus consists of 10,788 news documents totaling 1.3 million words. These documents span 90 categories and are split into train and test. For more details, please see https://www.nltk.org/book/ch02.html. We provide a `read_corpus` function below that pulls out only articles from the \"crude\" (i.e. news articles about oil, gas, etc.) category. The function also adds `<START>` and `<END>` tokens to each of the documents, and lowercases words. You do **not** have to perform any other kind of pre-processing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def read_corpus(category=\"crude\"):\n",
    "    \"\"\" Read files from the specified Reuter's category.\n",
    "        Params:\n",
    "            category (string): category name\n",
    "        Return:\n",
    "            list of lists, with words from each of the processed files\n",
    "    \"\"\"\n",
    "    files = reuters.fileids(category)\n",
    "    return [[START_TOKEN] + [w.lower() for w in list(reuters.words(f))] + [END_TOKEN] for f in files]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's have a look what these documents are like…."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[['<START>', 'japan', 'to', 'revise', 'long', '-', 'term', 'energy', 'demand', 'downwards', 'the',\n",
      "  'ministry', 'of', 'international', 'trade', 'and', 'industry', '(', 'miti', ')', 'will', 'revise',\n",
      "  'its', 'long', '-', 'term', 'energy', 'supply', '/', 'demand', 'outlook', 'by', 'august', 'to',\n",
      "  'meet', 'a', 'forecast', 'downtrend', 'in', 'japanese', 'energy', 'demand', ',', 'ministry',\n",
      "  'officials', 'said', '.', 'miti', 'is', 'expected', 'to', 'lower', 'the', 'projection', 'for',\n",
      "  'primary', 'energy', 'supplies', 'in', 'the', 'year', '2000', 'to', '550', 'mln', 'kilolitres',\n",
      "  '(', 'kl', ')', 'from', '600', 'mln', ',', 'they', 'said', '.', 'the', 'decision', 'follows',\n",
      "  'the', 'emergence', 'of', 'structural', 'changes', 'in', 'japanese', 'industry', 'following',\n",
      "  'the', 'rise', 'in', 'the', 'value', 'of', 'the', 'yen', 'and', 'a', 'decline', 'in', 'domestic',\n",
      "  'electric', 'power', 'demand', '.', 'miti', 'is', 'planning', 'to', 'work', 'out', 'a', 'revised',\n",
      "  'energy', 'supply', '/', 'demand', 'outlook', 'through', 'deliberations', 'of', 'committee',\n",
      "  'meetings', 'of', 'the', 'agency', 'of', 'natural', 'resources', 'and', 'energy', ',', 'the',\n",
      "  'officials', 'said', '.', 'they', 'said', 'miti', 'will', 'also', 'review', 'the', 'breakdown',\n",
      "  'of', 'energy', 'supply', 'sources', ',', 'including', 'oil', ',', 'nuclear', ',', 'coal', 'and',\n",
      "  'natural', 'gas', '.', 'nuclear', 'energy', 'provided', 'the', 'bulk', 'of', 'japan', \"'\", 's',\n",
      "  'electric', 'power', 'in', 'the', 'fiscal', 'year', 'ended', 'march', '31', ',', 'supplying',\n",
      "  'an', 'estimated', '27', 'pct', 'on', 'a', 'kilowatt', '/', 'hour', 'basis', ',', 'followed',\n",
      "  'by', 'oil', '(', '23', 'pct', ')', 'and', 'liquefied', 'natural', 'gas', '(', '21', 'pct', '),',\n",
      "  'they', 'noted', '.', '<END>'],\n",
      " ['<START>', 'energy', '/', 'u', '.', 's', '.', 'petrochemical', 'industry', 'cheap', 'oil',\n",
      "  'feedstocks', ',', 'the', 'weakened', 'u', '.', 's', '.', 'dollar', 'and', 'a', 'plant',\n",
      "  'utilization', 'rate', 'approaching', '90', 'pct', 'will', 'propel', 'the', 'streamlined', 'u',\n",
      "  '.', 's', '.', 'petrochemical', 'industry', 'to', 'record', 'profits', 'this', 'year', ',',\n",
      "  'with', 'growth', 'expected', 'through', 'at', 'least', '1990', ',', 'major', 'company',\n",
      "  'executives', 'predicted', '.', 'this', 'bullish', 'outlook', 'for', 'chemical', 'manufacturing',\n",
      "  'and', 'an', 'industrywide', 'move', 'to', 'shed', 'unrelated', 'businesses', 'has', 'prompted',\n",
      "  'gaf', 'corp', '&', 'lt', ';', 'gaf', '>,', 'privately', '-', 'held', 'cain', 'chemical', 'inc',\n",
      "  ',', 'and', 'other', 'firms', 'to', 'aggressively', 'seek', 'acquisitions', 'of', 'petrochemical',\n",
      "  'plants', '.', 'oil', 'companies', 'such', 'as', 'ashland', 'oil', 'inc', '&', 'lt', ';', 'ash',\n",
      "  '>,', 'the', 'kentucky', '-', 'based', 'oil', 'refiner', 'and', 'marketer', ',', 'are', 'also',\n",
      "  'shopping', 'for', 'money', '-', 'making', 'petrochemical', 'businesses', 'to', 'buy', '.', '\"',\n",
      "  'i', 'see', 'us', 'poised', 'at', 'the', 'threshold', 'of', 'a', 'golden', 'period', ',\"', 'said',\n",
      "  'paul', 'oreffice', ',', 'chairman', 'of', 'giant', 'dow', 'chemical', 'co', '&', 'lt', ';',\n",
      "  'dow', '>,', 'adding', ',', '\"', 'there', \"'\", 's', 'no', 'major', 'plant', 'capacity', 'being',\n",
      "  'added', 'around', 'the', 'world', 'now', '.', 'the', 'whole', 'game', 'is', 'bringing', 'out',\n",
      "  'new', 'products', 'and', 'improving', 'the', 'old', 'ones', '.\"', 'analysts', 'say', 'the',\n",
      "  'chemical', 'industry', \"'\", 's', 'biggest', 'customers', ',', 'automobile', 'manufacturers',\n",
      "  'and', 'home', 'builders', 'that', 'use', 'a', 'lot', 'of', 'paints', 'and', 'plastics', ',',\n",
      "  'are', 'expected', 'to', 'buy', 'quantities', 'this', 'year', '.', 'u', '.', 's', '.',\n",
      "  'petrochemical', 'plants', 'are', 'currently', 'operating', 'at', 'about', '90', 'pct',\n",
      "  'capacity', ',', 'reflecting', 'tighter', 'supply', 'that', 'could', 'hike', 'product', 'prices',\n",
      "  'by', '30', 'to', '40', 'pct', 'this', 'year', ',', 'said', 'john', 'dosher', ',', 'managing',\n",
      "  'director', 'of', 'pace', 'consultants', 'inc', 'of', 'houston', '.', 'demand', 'for', 'some',\n",
      "  'products', 'such', 'as', 'styrene', 'could', 'push', 'profit', 'margins', 'up', 'by', 'as',\n",
      "  'much', 'as', '300', 'pct', ',', 'he', 'said', '.', 'oreffice', ',', 'speaking', 'at', 'a',\n",
      "  'meeting', 'of', 'chemical', 'engineers', 'in', 'houston', ',', 'said', 'dow', 'would', 'easily',\n",
      "  'top', 'the', '741', 'mln', 'dlrs', 'it', 'earned', 'last', 'year', 'and', 'predicted', 'it',\n",
      "  'would', 'have', 'the', 'best', 'year', 'in', 'its', 'history', '.', 'in', '1985', ',', 'when',\n",
      "  'oil', 'prices', 'were', 'still', 'above', '25', 'dlrs', 'a', 'barrel', 'and', 'chemical',\n",
      "  'exports', 'were', 'adversely', 'affected', 'by', 'the', 'strong', 'u', '.', 's', '.', 'dollar',\n",
      "  ',', 'dow', 'had', 'profits', 'of', '58', 'mln', 'dlrs', '.', '\"', 'i', 'believe', 'the',\n",
      "  'entire', 'chemical', 'industry', 'is', 'headed', 'for', 'a', 'record', 'year', 'or', 'close',\n",
      "  'to', 'it', ',\"', 'oreffice', 'said', '.', 'gaf', 'chairman', 'samuel', 'heyman', 'estimated',\n",
      "  'that', 'the', 'u', '.', 's', '.', 'chemical', 'industry', 'would', 'report', 'a', '20', 'pct',\n",
      "  'gain', 'in', 'profits', 'during', '1987', '.', 'last', 'year', ',', 'the', 'domestic',\n",
      "  'industry', 'earned', 'a', 'total', 'of', '13', 'billion', 'dlrs', ',', 'a', '54', 'pct', 'leap',\n",
      "  'from', '1985', '.', 'the', 'turn', 'in', 'the', 'fortunes', 'of', 'the', 'once', '-', 'sickly',\n",
      "  'chemical', 'industry', 'has', 'been', 'brought', 'about', 'by', 'a', 'combination', 'of', 'luck',\n",
      "  'and', 'planning', ',', 'said', 'pace', \"'\", 's', 'john', 'dosher', '.', 'dosher', 'said', 'last',\n",
      "  'year', \"'\", 's', 'fall', 'in', 'oil', 'prices', 'made', 'feedstocks', 'dramatically', 'cheaper',\n",
      "  'and', 'at', 'the', 'same', 'time', 'the', 'american', 'dollar', 'was', 'weakening', 'against',\n",
      "  'foreign', 'currencies', '.', 'that', 'helped', 'boost', 'u', '.', 's', '.', 'chemical',\n",
      "  'exports', '.', 'also', 'helping', 'to', 'bring', 'supply', 'and', 'demand', 'into', 'balance',\n",
      "  'has', 'been', 'the', 'gradual', 'market', 'absorption', 'of', 'the', 'extra', 'chemical',\n",
      "  'manufacturing', 'capacity', 'created', 'by', 'middle', 'eastern', 'oil', 'producers', 'in',\n",
      "  'the', 'early', '1980s', '.', 'finally', ',', 'virtually', 'all', 'major', 'u', '.', 's', '.',\n",
      "  'chemical', 'manufacturers', 'have', 'embarked', 'on', 'an', 'extensive', 'corporate',\n",
      "  'restructuring', 'program', 'to', 'mothball', 'inefficient', 'plants', ',', 'trim', 'the',\n",
      "  'payroll', 'and', 'eliminate', 'unrelated', 'businesses', '.', 'the', 'restructuring', 'touched',\n",
      "  'off', 'a', 'flurry', 'of', 'friendly', 'and', 'hostile', 'takeover', 'attempts', '.', 'gaf', ',',\n",
      "  'which', 'made', 'an', 'unsuccessful', 'attempt', 'in', '1985', 'to', 'acquire', 'union',\n",
      "  'carbide', 'corp', '&', 'lt', ';', 'uk', '>,', 'recently', 'offered', 'three', 'billion', 'dlrs',\n",
      "  'for', 'borg', 'warner', 'corp', '&', 'lt', ';', 'bor', '>,', 'a', 'chicago', 'manufacturer',\n",
      "  'of', 'plastics', 'and', 'chemicals', '.', 'another', 'industry', 'powerhouse', ',', 'w', '.',\n",
      "  'r', '.', 'grace', '&', 'lt', ';', 'gra', '>', 'has', 'divested', 'its', 'retailing', ',',\n",
      "  'restaurant', 'and', 'fertilizer', 'businesses', 'to', 'raise', 'cash', 'for', 'chemical',\n",
      "  'acquisitions', '.', 'but', 'some', 'experts', 'worry', 'that', 'the', 'chemical', 'industry',\n",
      "  'may', 'be', 'headed', 'for', 'trouble', 'if', 'companies', 'continue', 'turning', 'their',\n",
      "  'back', 'on', 'the', 'manufacturing', 'of', 'staple', 'petrochemical', 'commodities', ',', 'such',\n",
      "  'as', 'ethylene', ',', 'in', 'favor', 'of', 'more', 'profitable', 'specialty', 'chemicals',\n",
      "  'that', 'are', 'custom', '-', 'designed', 'for', 'a', 'small', 'group', 'of', 'buyers', '.', '\"',\n",
      "  'companies', 'like', 'dupont', '&', 'lt', ';', 'dd', '>', 'and', 'monsanto', 'co', '&', 'lt', ';',\n",
      "  'mtc', '>', 'spent', 'the', 'past', 'two', 'or', 'three', 'years', 'trying', 'to', 'get', 'out',\n",
      "  'of', 'the', 'commodity', 'chemical', 'business', 'in', 'reaction', 'to', 'how', 'badly', 'the',\n",
      "  'market', 'had', 'deteriorated', ',\"', 'dosher', 'said', '.', '\"', 'but', 'i', 'think', 'they',\n",
      "  'will', 'eventually', 'kill', 'the', 'margins', 'on', 'the', 'profitable', 'chemicals', 'in',\n",
      "  'the', 'niche', 'market', '.\"', 'some', 'top', 'chemical', 'executives', 'share', 'the',\n",
      "  'concern', '.', '\"', 'the', 'challenge', 'for', 'our', 'industry', 'is', 'to', 'keep', 'from',\n",
      "  'getting', 'carried', 'away', 'and', 'repeating', 'past', 'mistakes', ',\"', 'gaf', \"'\", 's',\n",
      "  'heyman', 'cautioned', '.', '\"', 'the', 'shift', 'from', 'commodity', 'chemicals', 'may', 'be',\n",
      "  'ill', '-', 'advised', '.', 'specialty', 'businesses', 'do', 'not', 'stay', 'special', 'long',\n",
      "  '.\"', 'houston', '-', 'based', 'cain', 'chemical', ',', 'created', 'this', 'month', 'by', 'the',\n",
      "  'sterling', 'investment', 'banking', 'group', ',', 'believes', 'it', 'can', 'generate', '700',\n",
      "  'mln', 'dlrs', 'in', 'annual', 'sales', 'by', 'bucking', 'the', 'industry', 'trend', '.',\n",
      "  'chairman', 'gordon', 'cain', ',', 'who', 'previously', 'led', 'a', 'leveraged', 'buyout', 'of',\n",
      "  'dupont', \"'\", 's', 'conoco', 'inc', \"'\", 's', 'chemical', 'business', ',', 'has', 'spent', '1',\n",
      "  '.', '1', 'billion', 'dlrs', 'since', 'january', 'to', 'buy', 'seven', 'petrochemical', 'plants',\n",
      "  'along', 'the', 'texas', 'gulf', 'coast', '.', 'the', 'plants', 'produce', 'only', 'basic',\n",
      "  'commodity', 'petrochemicals', 'that', 'are', 'the', 'building', 'blocks', 'of', 'specialty',\n",
      "  'products', '.', '\"', 'this', 'kind', 'of', 'commodity', 'chemical', 'business', 'will', 'never',\n",
      "  'be', 'a', 'glamorous', ',', 'high', '-', 'margin', 'business', ',\"', 'cain', 'said', ',',\n",
      "  'adding', 'that', 'demand', 'is', 'expected', 'to', 'grow', 'by', 'about', 'three', 'pct',\n",
      "  'annually', '.', 'garo', 'armen', ',', 'an', 'analyst', 'with', 'dean', 'witter', 'reynolds', ',',\n",
      "  'said', 'chemical', 'makers', 'have', 'also', 'benefitted', 'by', 'increasing', 'demand', 'for',\n",
      "  'plastics', 'as', 'prices', 'become', 'more', 'competitive', 'with', 'aluminum', ',', 'wood',\n",
      "  'and', 'steel', 'products', '.', 'armen', 'estimated', 'the', 'upturn', 'in', 'the', 'chemical',\n",
      "  'business', 'could', 'last', 'as', 'long', 'as', 'four', 'or', 'five', 'years', ',', 'provided',\n",
      "  'the', 'u', '.', 's', '.', 'economy', 'continues', 'its', 'modest', 'rate', 'of', 'growth', '.',\n",
      "  '<END>'],\n",
      " ['<START>', 'turkey', 'calls', 'for', 'dialogue', 'to', 'solve', 'dispute', 'turkey', 'said',\n",
      "  'today', 'its', 'disputes', 'with', 'greece', ',', 'including', 'rights', 'on', 'the',\n",
      "  'continental', 'shelf', 'in', 'the', 'aegean', 'sea', ',', 'should', 'be', 'solved', 'through',\n",
      "  'negotiations', '.', 'a', 'foreign', 'ministry', 'statement', 'said', 'the', 'latest', 'crisis',\n",
      "  'between', 'the', 'two', 'nato', 'members', 'stemmed', 'from', 'the', 'continental', 'shelf',\n",
      "  'dispute', 'and', 'an', 'agreement', 'on', 'this', 'issue', 'would', 'effect', 'the', 'security',\n",
      "  ',', 'economy', 'and', 'other', 'rights', 'of', 'both', 'countries', '.', '\"', 'as', 'the',\n",
      "  'issue', 'is', 'basicly', 'political', ',', 'a', 'solution', 'can', 'only', 'be', 'found', 'by',\n",
      "  'bilateral', 'negotiations', ',\"', 'the', 'statement', 'said', '.', 'greece', 'has', 'repeatedly',\n",
      "  'said', 'the', 'issue', 'was', 'legal', 'and', 'could', 'be', 'solved', 'at', 'the',\n",
      "  'international', 'court', 'of', 'justice', '.', 'the', 'two', 'countries', 'approached', 'armed',\n",
      "  'confrontation', 'last', 'month', 'after', 'greece', 'announced', 'it', 'planned', 'oil',\n",
      "  'exploration', 'work', 'in', 'the', 'aegean', 'and', 'turkey', 'said', 'it', 'would', 'also',\n",
      "  'search', 'for', 'oil', '.', 'a', 'face', '-', 'off', 'was', 'averted', 'when', 'turkey',\n",
      "  'confined', 'its', 'research', 'to', 'territorrial', 'waters', '.', '\"', 'the', 'latest',\n",
      "  'crises', 'created', 'an', 'historic', 'opportunity', 'to', 'solve', 'the', 'disputes', 'between',\n",
      "  'the', 'two', 'countries', ',\"', 'the', 'foreign', 'ministry', 'statement', 'said', '.', 'turkey',\n",
      "  \"'\", 's', 'ambassador', 'in', 'athens', ',', 'nazmi', 'akiman', ',', 'was', 'due', 'to', 'meet',\n",
      "  'prime', 'minister', 'andreas', 'papandreou', 'today', 'for', 'the', 'greek', 'reply', 'to', 'a',\n",
      "  'message', 'sent', 'last', 'week', 'by', 'turkish', 'prime', 'minister', 'turgut', 'ozal', '.',\n",
      "  'the', 'contents', 'of', 'the', 'message', 'were', 'not', 'disclosed', '.', '<END>']]\n"
     ]
    }
   ],
   "source": [
    "reuters_corpus = read_corpus()\n",
    "pprint.pprint(reuters_corpus[:3], compact=True, width=100)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 1.1: Implement `distinct_words` [code] (2 points)\n",
    "\n",
    "Write a method to work out the distinct words (word types) that occur in the corpus. You can do this with `for` loops, but it's more efficient to do it with Python list comprehensions. In particular, [this](https://coderwall.com/p/rcmaea/flatten-a-list-of-lists-in-one-line-in-python) may be useful to flatten a list of lists. If you're not familiar with Python list comprehensions in general, here's [more information](https://python-3-patterns-idioms-test.readthedocs.io/en/latest/Comprehensions.html).\n",
    "\n",
    "Your returned `corpus_words` should be sorted. You can use python's `sorted` function for this.\n",
    "\n",
    "You may find it useful to use [Python sets](https://www.w3schools.com/python/python_sets.asp) to remove duplicate words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def distinct_words(corpus):\n",
    "    \"\"\" Determine a list of distinct words for the corpus.\n",
    "        Params:\n",
    "            corpus (list of list of strings): corpus of documents\n",
    "        Return:\n",
    "            corpus_words (list of strings): sorted list of distinct words across the corpus\n",
    "            num_corpus_words (integer): number of distinct words across the corpus\n",
    "    \"\"\"\n",
    "    corpus_words = []\n",
    "    num_corpus_words = -1\n",
    "    \n",
    "    # ------------------\n",
    "    # Write your implementation here.\n",
    "    corpus_words = {word for doc in corpus for word in doc}\n",
    "    corpus_words = sorted(list(corpus_words))\n",
    "    num_corpus_words = len(corpus_words)\n",
    "    # ------------------\n",
    "\n",
    "    return corpus_words, num_corpus_words"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--------------------------------------------------------------------------------\n",
      "Passed All Tests!\n",
      "--------------------------------------------------------------------------------\n"
     ]
    }
   ],
   "source": [
    "# ---------------------\n",
    "# Run this sanity check\n",
    "# Note that this not an exhaustive check for correctness.\n",
    "# ---------------------\n",
    "\n",
    "# Define toy corpus\n",
    "test_corpus = [\"{} All that glitters isn't gold {}\".format(START_TOKEN, END_TOKEN).split(\" \"), \"{} All's well that ends well {}\".format(START_TOKEN, END_TOKEN).split(\" \")]\n",
    "test_corpus_words, num_corpus_words = distinct_words(test_corpus)\n",
    "\n",
    "# Correct answers\n",
    "ans_test_corpus_words = sorted([START_TOKEN, \"All\", \"ends\", \"that\", \"gold\", \"All's\", \"glitters\", \"isn't\", \"well\", END_TOKEN])\n",
    "ans_num_corpus_words = len(ans_test_corpus_words)\n",
    "\n",
    "# Test correct number of words\n",
    "assert(num_corpus_words == ans_num_corpus_words), \"Incorrect number of distinct words. Correct: {}. Yours: {}\".format(ans_num_corpus_words, num_corpus_words)\n",
    "\n",
    "# Test correct words\n",
    "assert (test_corpus_words == ans_test_corpus_words), \"Incorrect corpus_words.\\nCorrect: {}\\nYours:   {}\".format(str(ans_test_corpus_words), str(test_corpus_words))\n",
    "\n",
    "# Print Success\n",
    "print (\"-\" * 80)\n",
    "print(\"Passed All Tests!\")\n",
    "print (\"-\" * 80)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 1.2: Implement `compute_co_occurrence_matrix` [code] (3 points)\n",
    "\n",
    "Write a method that constructs a co-occurrence matrix for a certain window-size $n$ (with a default of 4), considering words $n$ before and $n$ after the word in the center of the window. Here, we start to use `numpy (np)` to represent vectors, matrices, and tensors. If you're not familiar with NumPy, there's a NumPy tutorial in the second half of this cs231n [Python NumPy tutorial](http://cs231n.github.io/python-numpy-tutorial/).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "def compute_co_occurrence_matrix(corpus, window_size=4):\n",
    "    \"\"\" Compute co-occurrence matrix for the given corpus and window_size (default of 4).\n",
    "    \n",
    "        Note: Each word in a document should be at the center of a window. Words near edges will have a smaller\n",
    "              number of co-occurring words.\n",
    "              \n",
    "              For example, if we take the document \"<START> All that glitters is not gold <END>\" with window size of 4,\n",
    "              \"All\" will co-occur with \"<START>\", \"that\", \"glitters\", \"is\", and \"not\".\n",
    "    \n",
    "        Params:\n",
    "            corpus (list of list of strings): corpus of documents\n",
    "            window_size (int): size of context window\n",
    "        Return:\n",
    "            M (a symmetric numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)): \n",
    "                Co-occurence matrix of word counts. \n",
    "                The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function.\n",
    "            word2ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M.\n",
    "    \"\"\"\n",
    "    words, num_words = distinct_words(corpus)\n",
    "    M = None\n",
    "    word2ind = {}\n",
    "    \n",
    "    # ------------------\n",
    "    # Write your implementation here.\n",
    "    \n",
    "    # Build the word to index mapping.\n",
    "    word2ind = {word: i for i, word in enumerate(words)}\n",
    "    \n",
    "    # Build the co-occurrence matrix.\n",
    "    M = np.zeros((num_words, num_words))\n",
    "    for body in corpus:\n",
    "        for curr_idx, word in enumerate(body):\n",
    "            for window_idx in range(-window_size, window_size + 1):\n",
    "                neighbor_idx = curr_idx + window_idx\n",
    "                if (neighbor_idx < 0) or (neighbor_idx >= len(body)) or (curr_idx == neighbor_idx):\n",
    "                    continue\n",
    "                co_occur_word = body[neighbor_idx]\n",
    "                (word_idx, co_occur_idx) = (word2ind[word], word2ind[co_occur_word])\n",
    "                M[word_idx, co_occur_idx] += 1\n",
    "\n",
    "    # ------------------\n",
    "\n",
    "    return M, word2ind"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--------------------------------------------------------------------------------\n",
      "Passed All Tests!\n",
      "--------------------------------------------------------------------------------\n"
     ]
    }
   ],
   "source": [
    "# ---------------------\n",
    "# Run this sanity check\n",
    "# Note that this is not an exhaustive check for correctness.\n",
    "# ---------------------\n",
    "\n",
    "# Define toy corpus and get student's co-occurrence matrix\n",
    "test_corpus = [\"{} All that glitters isn't gold {}\".format(START_TOKEN, END_TOKEN).split(\" \"), \"{} All's well that ends well {}\".format(START_TOKEN, END_TOKEN).split(\" \")]\n",
    "M_test, word2ind_test = compute_co_occurrence_matrix(test_corpus, window_size=1)\n",
    "\n",
    "# Correct M and word2ind\n",
    "M_test_ans = np.array( \n",
    "    [[0., 0., 0., 0., 0., 0., 1., 0., 0., 1.,],\n",
    "     [0., 0., 1., 1., 0., 0., 0., 0., 0., 0.,],\n",
    "     [0., 1., 0., 0., 0., 0., 0., 0., 1., 0.,],\n",
    "     [0., 1., 0., 0., 0., 0., 0., 0., 0., 1.,],\n",
    "     [0., 0., 0., 0., 0., 0., 0., 0., 1., 1.,],\n",
    "     [0., 0., 0., 0., 0., 0., 0., 1., 1., 0.,],\n",
    "     [1., 0., 0., 0., 0., 0., 0., 1., 0., 0.,],\n",
    "     [0., 0., 0., 0., 0., 1., 1., 0., 0., 0.,],\n",
    "     [0., 0., 1., 0., 1., 1., 0., 0., 0., 1.,],\n",
    "     [1., 0., 0., 1., 1., 0., 0., 0., 1., 0.,]]\n",
    ")\n",
    "ans_test_corpus_words = sorted([START_TOKEN, \"All\", \"ends\", \"that\", \"gold\", \"All's\", \"glitters\", \"isn't\", \"well\", END_TOKEN])\n",
    "word2ind_ans = dict(zip(ans_test_corpus_words, range(len(ans_test_corpus_words))))\n",
    "\n",
    "# Test correct word2ind\n",
    "assert (word2ind_ans == word2ind_test), \"Your word2ind is incorrect:\\nCorrect: {}\\nYours: {}\".format(word2ind_ans, word2ind_test)\n",
    "\n",
    "# Test correct M shape\n",
    "assert (M_test.shape == M_test_ans.shape), \"M matrix has incorrect shape.\\nCorrect: {}\\nYours: {}\".format(M_test.shape, M_test_ans.shape)\n",
    "\n",
    "# Test correct M values\n",
    "for w1 in word2ind_ans.keys():\n",
    "    idx1 = word2ind_ans[w1]\n",
    "    for w2 in word2ind_ans.keys():\n",
    "        idx2 = word2ind_ans[w2]\n",
    "        student = M_test[idx1, idx2]\n",
    "        correct = M_test_ans[idx1, idx2]\n",
    "        if student != correct:\n",
    "            print(\"Correct M:\")\n",
    "            print(M_test_ans)\n",
    "            print(\"Your M: \")\n",
    "            print(M_test)\n",
    "            raise AssertionError(\"Incorrect count at index ({}, {})=({}, {}) in matrix M. Yours has {} but should have {}.\".format(idx1, idx2, w1, w2, student, correct))\n",
    "\n",
    "# Print Success\n",
    "print (\"-\" * 80)\n",
    "print(\"Passed All Tests!\")\n",
    "print (\"-\" * 80)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 1.3: Implement `reduce_to_k_dim` [code] (1 point)\n",
    "\n",
    "Construct a method that performs dimensionality reduction on the matrix to produce k-dimensional embeddings. Use SVD to take the top k components and produce a new matrix of k-dimensional embeddings. \n",
    "\n",
    "**Note:** All of numpy, scipy, and scikit-learn (`sklearn`) provide *some* implementation of SVD, but only scipy and sklearn provide an implementation of Truncated SVD, and only sklearn provides an efficient randomized algorithm for calculating large-scale Truncated SVD. So please use [sklearn.decomposition.TruncatedSVD](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "def reduce_to_k_dim(M, k=2):\n",
    "    \"\"\" Reduce a co-occurence count matrix of dimensionality (num_corpus_words, num_corpus_words)\n",
    "        to a matrix of dimensionality (num_corpus_words, k) using the following SVD function from Scikit-Learn:\n",
    "            - http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html\n",
    "    \n",
    "        Params:\n",
    "            M (numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)): co-occurence matrix of word counts\n",
    "            k (int): embedding size of each word after dimension reduction\n",
    "        Return:\n",
    "            M_reduced (numpy matrix of shape (number of corpus words, k)): matrix of k-dimensioal word embeddings.\n",
    "                    In terms of the SVD from math class, this actually returns U * S\n",
    "    \"\"\"    \n",
    "    n_iters = 10     # Use this parameter in your call to `TruncatedSVD`\n",
    "    M_reduced = None\n",
    "    print(\"Running Truncated SVD over %i words...\" % (M.shape[0]))\n",
    "    \n",
    "    # ------------------\n",
    "    # Write your implementation here.\n",
    "\n",
    "    svd = TruncatedSVD(n_components = k, n_iter = n_iters)\n",
    "    M_reduced = svd.fit_transform(M) \n",
    "    # or (instead of the above line)...\n",
    "    # svd.fit(M)\n",
    "    # M_reduced = svd.transform(M)\n",
    "    \n",
    "    # ------------------\n",
    "\n",
    "    print(\"Done.\")\n",
    "    return M_reduced"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Running Truncated SVD over 10 words...\n",
      "Done.\n",
      "--------------------------------------------------------------------------------\n",
      "Passed All Tests!\n",
      "--------------------------------------------------------------------------------\n"
     ]
    }
   ],
   "source": [
    "# ---------------------\n",
    "# Run this sanity check\n",
    "# Note that this is not an exhaustive check for correctness \n",
    "# In fact we only check that your M_reduced has the right dimensions.\n",
    "# ---------------------\n",
    "\n",
    "# Define toy corpus and run student code\n",
    "test_corpus = [\"{} All that glitters isn't gold {}\".format(START_TOKEN, END_TOKEN).split(\" \"), \"{} All's well that ends well {}\".format(START_TOKEN, END_TOKEN).split(\" \")]\n",
    "M_test, word2ind_test = compute_co_occurrence_matrix(test_corpus, window_size=1)\n",
    "M_test_reduced = reduce_to_k_dim(M_test, k=2)\n",
    "\n",
    "# Test proper dimensions\n",
    "assert (M_test_reduced.shape[0] == 10), \"M_reduced has {} rows; should have {}\".format(M_test_reduced.shape[0], 10)\n",
    "assert (M_test_reduced.shape[1] == 2), \"M_reduced has {} columns; should have {}\".format(M_test_reduced.shape[1], 2)\n",
    "\n",
    "# Print Success\n",
    "print (\"-\" * 80)\n",
    "print(\"Passed All Tests!\")\n",
    "print (\"-\" * 80)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 1.4: Implement `plot_embeddings` [code] (1 point)\n",
    "\n",
    "Here you will write a function to plot a set of 2D vectors in 2D space. For graphs, we will use Matplotlib (`plt`).\n",
    "\n",
    "For this example, you may find it useful to adapt [this code](http://web.archive.org/web/20190924160434/https://www.pythonmembers.club/2018/05/08/matplotlib-scatter-plot-annotate-set-text-at-label-each-point/). In the future, a good way to make a plot is to look at [the Matplotlib gallery](https://matplotlib.org/gallery/index.html), find a plot that looks somewhat like what you want, and adapt the code they give."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_embeddings(M_reduced, word2ind, words):\n",
    "    \"\"\" Plot in a scatterplot the embeddings of the words specified in the list \"words\".\n",
    "        NOTE: do not plot all the words listed in M_reduced / word2ind.\n",
    "        Include a label next to each point.\n",
    "        \n",
    "        Params:\n",
    "            M_reduced (numpy matrix of shape (number of unique words in the corpus , 2)): matrix of 2-dimensioal word embeddings\n",
    "            word2ind (dict): dictionary that maps word to indices for matrix M\n",
    "            words (list of strings): words whose embeddings we want to visualize\n",
    "    \"\"\"\n",
    "\n",
    "    # ------------------\n",
    "    # Write your implementation here.\n",
    "    \n",
    "    # Get only the rows corresponding the words we want to plot.\n",
    "    word_idxs = [word2ind[word] for word in words]\n",
    "    word_vectors = M_reduced[word_idxs]\n",
    "    # Get 2D coordinates.\n",
    "    x_coords = [vec[0] for vec in word_vectors]\n",
    "    y_coords = [vec[1] for vec in word_vectors]\n",
    "    # Plot the scatter points in 2D.\n",
    "    for i, word in enumerate(words):\n",
    "        x = x_coords[i]\n",
    "        y = y_coords[i]\n",
    "        plt.scatter(x, y, marker='x', color='red')\n",
    "        plt.text(x, y, word, fontsize=9)\n",
    "    plt.show()\n",
    "\n",
    "    # ------------------"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--------------------------------------------------------------------------------\n",
      "Outputted Plot:\n"
     ]
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAmIAAAEvCAYAAADmeK3JAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8vihELAAAACXBIWXMAAAsTAAALEwEAmpwYAAAftUlEQVR4nO3df7BdZX3v8ffXBEZCHQNJiBFIoSbtAP1BPGeAigNHAc2PK1HUAVoERSfjrUhIrLdBhgYqzgUdkh7vtdLUMiRMK+MdRM4IqUDwiErxkmOBgEAICJeYCEg9IkLBkO/9Y6/Ezcn5sU/2Tp6zc96vmT17rfU8z1rPk5XF+rCevXciM5EkSdLe94bSHZAkSRqvDGKSJEmFGMQkSZIKMYhJkiQVYhCTJEkqxCAmSZJUyMTSHdgdU6dOzSOOOKJ0NyRJkkbU19f3i8ycNlhZWwaxI444gvXr15fuhiRJ0ogi4qmhypyalCRJKsQgJkmSVIhBTJIkqRCDmCRJUiEGsWH09/ezZs2aUbV58skn6enp2WX7eeedx6mnntqqrkmSpBZoxb3+pptu4qijjuKNb3zjqI9vEBtGq4LYhg0b6O/vb2HPJElSK7TiXn/SSSfxH//xHxx22GGjPr5BbKDMnYsrVqygr6+Prq4uVq9ezYIFC3j3u9/NggULeO6553jppZeYN28eJ598Ml1dXWzcuJEVK1Zwyy230NXVRV9fHwB/93d/x+c+97lSI5IkSQNV9/tW3OunTJmyW0/DoEVBLCKujYhnI+LBIcojIr4cEZsi4oGIeHtd2dyIeLQqW9aK/uy2yy6DJUt2npylS5bQ8aY30dvVxdq1a7n00ku58847WbRoEVdddRWPPPIIBx10EN/73vfo7e1l1qxZLF26lAULFtDb20tHRwe9vb384R/+IdOnTy86NEmSVKm73y9durR2vz72WNZ+6Uu7da9vRqt+0PU64H8DQz3bmwfMrl7HA18Fjo+ICcBXgNOAzcC9EdGTmT9pUb8alwn9/dDdXVtfuRI+/3l4+mno72fDhg0sW1bLidu2bWPWrFnMmTOHjo4OzjnnHKZMmcLll1++y26vvPJKbrjhBqcmJUkaCwbe7xcvho0bYd06Nhx88G7d65vRkiCWmXdFxBHDVFkIrMnMBO6JiMkRMQM4AtiUmU8ARMQNVd29H8QiauELaienu5v9gW2HHgorV3LMmWdy8cUXM2fOHABeffVVXnnlFZYuXUpEcMUVV3D99dfT0dHBtm3bAPj1r3/Nz3/+c8466yxefvllHnroIb7whS9wySWX7PXhSZIkdrnf79/dzTaAxYs5ZsuWUd/rm+5O1n0mqqkd1YLYtzPzjwcp+zZwZWb+oFpfB/wNtSA2NzM/UW3/CHB8Zl4wyD4WAYsAZs6c2fHUU0P+awHNyYQ31GZstwML5s5l0qRJnH766dx44428+OKLAJx//vkcffTRXHjhhUycOJHt27ezevVqpk6dyvz585k+fTrLly/nT/7kT4DaB/s+8YlPcMcdd+yZfkuSpMZV9/vtwAJg0hln7Pa9vr+/n8svv5y7776bd7zjHfzVX/0VZ5xxxs5DRURfZnYO1o29FcRuAf7ngCD2P4A/AN47IIgdl5mfHu5YnZ2duUf+rcnM2pzxjseVUHtkuXJlLUFLkqT2t5fv98MFsb31rcnNwOF164cBW4bZvvfVn5TFi2H79tp7d/frPsAvSZLa2Bi737fqw/oj6QEuqD4Ddjzwq8zcGhHPAbMj4kjgZ8BZwF/spT69XgRMnvz6RLxjDnnyZJ+ISZK0Lxhj9/uWTE1GxNeBLmAq8AywHNgPIDOviYig9q3KucBLwMcyc33Vdj7w98AE4NrM/MJIx9tjU5O1Dr/+JAxclyRJ7W8v3u+Hm5ps1bcmzx6hPIFPDVF2K3BrK/rREgNPgiFMkqR9zxi53/vL+pIkSYUYxCRJkgoxiEmSJBViEJMkSSrEICZJklSIQUySJKkQg5gkSVIhBjFJkqRCDGKSJEmFGMQkSZIKMYhJkiQVYhCTJEkqxCAmSZJUiEFMkiSpEIOYJElSIQYxSZKkQgxikiRJhRjEJEmSCjGISZIkFWIQkyRJKsQgJkmSVEhLglhEzI2IRyNiU0QsG6T8sxFxX/V6MCJei4iDq7InI2JDVba+Ff2RJElqBxOb3UFETAC+ApwGbAbujYiezPzJjjqZ+SXgS1X99wFLMvM/63bzrsz8RbN9kSRJaieteCJ2HLApM5/IzFeBG4CFw9Q/G/h6C44rSZLU1loRxA4Fnq5b31xt20VETALmAjfWbU7gtojoi4hFLeiPJElSW2h6ahKIQbblEHXfB/xwwLTkiZm5JSIOAW6PiEcy865dDlILaYsAZs6c2WyfJUmSimvFE7HNwOF164cBW4aoexYDpiUzc0v1/ixwE7Wpzl1k5qrM7MzMzmnTpjXdaUmSpNJaEcTuBWZHxJERsT+1sNUzsFJEvBk4Gbi5btuBEfGmHcvAe4AHW9AnSZKkMa/pqcnM3BYRFwDfASYA12bmQxHxyar8mqrqB4DbMvM3dc2nAzdFxI6+/Gtm/luzfZIkSWoHkTnUx7nGrs7Ozly/3p8ckyRJY19E9GVm52Bl/rK+JElSIQYxSZKkQgxikiRJhRjEJEmSCjGISZIkFWIQkyRJKsQgJkmSVIhBTJIkqRCDmCRJUiEGMUmSpEIMYpIkSYUYxCRJkgoxiEmSJBViEJMkSSrEICZJklSIQUySJKkQg5gkSVIhBjFJkqRCDGKSJEmFGMQkSZIKMYhJkiQVYhCTJEkqpCVBLCLmRsSjEbEpIpYNUt4VEb+KiPuq19822laSJGlfNbHZHUTEBOArwGnAZuDeiOjJzJ8MqPr9zPxvu9lWkiRpn9OKJ2LHAZsy84nMfBW4AVi4F9pKkiS1tVYEsUOBp+vWN1fbBvrziLg/ItZGxDGjbCtJkrTPaXpqEohBtuWA9R8Dv5+ZL0bEfOBbwOwG29YOErEIWAQwc+bM3e6sJEnSWNGKJ2KbgcPr1g8DttRXyMwXMvPFavlWYL+ImNpI27p9rMrMzszsnDZtWgu6LUmSVFYrgti9wOyIODIi9gfOAnrqK0TEWyIiquXjquM+30hbSZKkfVXTU5OZuS0iLgC+A0wArs3MhyLik1X5NcCHgP8eEduAl4GzMjOBQds22ydJkqR2ELU81F46Oztz/fr1pbshSZI0oojoy8zOwcr8ZX1JkqRCDGKSJEmFGMQkSZIKMYhJkiQVYhCTJEkqxCAmSZJUiEFMkiSpEIOYJElSIQYxSZKkQgxikiRJhRjEJEmSCjGISZIkFWIQkyRJKsQgJkmSVIhBTJIkqRCDmCRJUiEGMUmSpEIMYpIkSYUYxCRJkgoxiEmSJBViEJMkSSrEICZJklRIS4JYRMyNiEcjYlNELBuk/C8j4oHqdXdE/Fld2ZMRsSEi7ouI9a3ojyRJUjuY2OwOImIC8BXgNGAzcG9E9GTmT+qq/RQ4OTN/GRHzgFXA8XXl78rMXzTbF0mSpHbSiidixwGbMvOJzHwVuAFYWF8hM+/OzF9Wq/cAh7XguJIkSW2tFUHsUODpuvXN1bahfBxYW7eewG0R0RcRi1rQH0mSpLbQ9NQkEINsy0ErRryLWhB7Z93mEzNzS0QcAtweEY9k5l2DtF0ELAKYOXNm872WJEkqrBVPxDYDh9etHwZsGVgpIv4U+BqwMDOf37E9M7dU788CN1Gb6txFZq7KzM7M7Jw2bVoLui1JklRWK4LYvcDsiDgyIvYHzgJ66itExEzgm8BHMnNj3fYDI+JNO5aB9wAPtqBPkiRJY17TU5OZuS0iLgC+A0wArs3MhyLik1X5NcDfAlOAf4gIgG2Z2QlMB26qtk0E/jUz/63ZPkmSJLWDyBz041xjWmdnZ65f70+OSZKksS8i+qoHULvwl/UlSZIKMYhJkiQVYhCTJEkqxCAmSZJUiEFMkiSpEIOYJElSIQYxSZKkQgxikiRJhRjEJEmSCjGISZIkFWIQkyRJKsQgJkmSVIhBTJIkqRCDmCRJUiEGMUmSpEIMYpIkSYUYxCRJkgoxiEmSJBViEJMkSSrEICZJlf7+ftasWTOqNk8++SQ9PT071y+77DKOOuoourq66Orq4rXXXmt1NyXtQwxiklRpRRADuOSSS+jt7aW3t5cJEya0souS9jEGMUmqrFixgr6+Prq6uli9ejULFizg3e9+NwsWLOC5557jpZdeYt68eZx88sl0dXWxceNGVqxYwS233EJXVxd9fX0AfPGLX+Sd73wnX/7ylwuPSNJY15IgFhFzI+LRiNgUEcsGKY+I+HJV/kBEvL3RtpK0x2UCsHTpUjo6Ouj97ndZu3Ytl156KXfeeSeLFi3iqquu4pFHHuGggw7ie9/7Hr29vcyaNYulS5eyYMECent76ejo4NOf/jT3338/t99+Oz09Pdx1112FBydpLJvY7A4iYgLwFeA0YDNwb0T0ZOZP6qrNA2ZXr+OBrwLHN9hWkvacyy6D/n5YufJ325YsYcOdd7Ls5z8HYNu2bcyaNYs5c+bQ0dHBOeecw5QpU7j88st32d2UKVMAOOCAAzjjjDPo6+vjpJNO2gsDkdSOmg5iwHHApsx8AiAibgAWAvVhaiGwJjMTuCciJkfEDOCIBtpK0p6RWQth3d0A7P/Zz7LtkUdg3TqOmT2bi1esYM7baw/wX331VV555RWWLl1KRHDFFVdw/fXX09HRwbZt23busr+/n8mTJ5OZ9Pb28tGPfrTAwCS1i1YEsUOBp+vWN1N76jVSnUMbbCtJe0bE756EdXfzlu5uDgA++La3cfrnPsfyyy7jxRdfBOD888/n6KOP5sILL2TixIls376d1atXM3XqVB5//HE+9KEPsXz5cq6++moeffRRMpOuri7mz59fbnySxrxWBLEYZFs2WKeRtrUdRCwCFgHMnDlzNP2TpKHtCGPd3bwBWAvw2GMQwXmDPM36wQ9+sMu273//+zuXr7vuuj3VU0n7oFZ8WH8zcHjd+mHAlgbrNNIWgMxclZmdmdk5bdq0pjstSUBtenLJktdvW7Jk5wf4JWlPakUQuxeYHRFHRsT+wFlAz4A6PcC51bcnTwB+lZlbG2wrSXvGjhDW3Q2LF8P27bX37m7DmKS9oumpyczcFhEXAN8BJgDXZuZDEfHJqvwa4FZgPrAJeAn42HBtm+2TJDUkAiZProWvlStf/5mxyZNr65K0B0W24f/xdXZ25vr160t3Q9K+IvP1oWvguiQ1ISL6MrNzsDJ/WV+SBoYuQ5ikvcQgJkmSVIhBTJIkqRCDmCRJUiEGMUmSpEIMYpIkSYUYxCRJkgoxiEmSJBViEJMkSSrEICZJklSIQUySJKkQg5gkSVIhBjFJkqRCDGKSJEmFGMQkSZIKMYhJkiQVYhCTJEkqxCAmSZJUiEFMkiSpEIOYJElSIQYxSZKkQgxikiRJhTQVxCLi4Ii4PSIeq94PGqTO4RHx3Yh4OCIeiojFdWWXRcTPIuK+6jW/mf5IkiS1k2afiC0D1mXmbGBdtT7QNuAzmXkUcALwqYg4uq58ZWYeW71ubbI/kiRJbaPZILYQWF0trwbeP7BCZm7NzB9Xy78GHgYObfK4kiRJba/ZIDY9M7dCLXABhwxXOSKOAOYAP6rbfEFEPBAR1w42tSlJkrSvGjGIRcQdEfHgIK+FozlQRPwecCNwUWa+UG3+KvA24FhgK3D1MO0XRcT6iFj/3HPPjebQkiRJY9LEkSpk5qlDlUXEMxExIzO3RsQM4Nkh6u1HLYT9S2Z+s27fz9TV+Sfg28P0YxWwCqCzszNH6rckSdJY1+zUZA9wXrV8HnDzwAoREcA/Aw9n5ooBZTPqVj8APNhkfyRJktpGs0HsSuC0iHgMOK1aJyLeGhE7vgF5IvAR4N2D/EzFFyNiQ0Q8ALwLWNJkfyRJktrGiFOTw8nM54FTBtm+BZhfLf8AiCHaf6SZ40uSJLUzf1lfkiSpEIOYJElSIQYxSZKkQgxikiRJhRjEJEmSCjGISZIkFWIQkyRJKsQgJkmSVIhBTJIkqRCDmCRJUiEGMUmSpEIMYpIkSYUYxCRJkgoxiEmSJBViEJMkSSrEICZJklSIQUySJKkQg5gkSVIhBjFJkqRCDGKSJEmFGMQkSZIKMYhJkiQV0lQQi4iDI+L2iHisej9oiHpPRsSGiLgvItaPtr0kSdK+qNknYsuAdZk5G1hXrQ/lXZl5bGZ27mZ7SZKkfUqzQWwhsLpaXg28fy+3lyRJalvNBrHpmbkVoHo/ZIh6CdwWEX0RsWg32kuSJO1zJo5UISLuAN4ySNElozjOiZm5JSIOAW6PiEcy865RtKcKcIsAZs6cOZqmkiRJY9KIQSwzTx2qLCKeiYgZmbk1ImYAzw6xjy3V+7MRcRNwHHAX0FD7qu0qYBVAZ2dnjtRvSZKksa7Zqcke4Lxq+Tzg5oEVIuLAiHjTjmXgPcCDjbaXJEnaVzUbxK4ETouIx4DTqnUi4q0RcWtVZzrwg4i4H/i/wC2Z+W/DtZckSRoPRpyaHE5mPg+cMsj2LcD8avkJ4M9G016SJGk88Jf1JUmSCjGISZIkFWIQkyRJKsQgJkmSVIhBTJIkqRCDmCRJUiEGMUmSpEIMYpIkSYUYxCRJkgoxiEmSJBViEJMkSSrEICZJklSIQUySJKkQg5gkSVIhBjFJkqRCDGKSJEmFGMQkSZIKMYhJkiQVYhCTJEkqxCAmSZJUiEFMkiSpEIOYJElSIU0FsYg4OCJuj4jHqveDBqnzRxFxX93rhYi4qCq7LCJ+Vlc2v5n+SJIktZNmn4gtA9Zl5mxgXbX+Opn5aGYem5nHAh3AS8BNdVVW7ijPzFub7I8kSVLbaDaILQRWV8urgfePUP8U4PHMfKrJ40qSJLW9ZoPY9MzcClC9HzJC/bOArw/YdkFEPBAR1w42tSlJkrSvGjGIRcQdEfHgIK+FozlQROwPnA78n7rNXwXeBhwLbAWuHqb9oohYHxHrn3vuudEcWpIkaUyaOFKFzDx1qLKIeCYiZmTm1oiYATw7zK7mAT/OzGfq9r1zOSL+Cfj2MP1YBawC6OzszJH6LUmSNNY1OzXZA5xXLZ8H3DxM3bMZMC1ZhbcdPgA82GR/JEmS2kazQexK4LSIeAw4rVonIt4aETu/ARkRk6rybw5o/8WI2BARDwDvApY02R9JkqS2MeLU5HAy83lq34QcuH0LML9u/SVgyiD1PtLM8SVJktqZv6wvSZJUiEFMkiSpEIOYJElSIQYxSZKkQgxikiRJhRjEJEmSCjGISZIkFWIQkyRJKsQgJkmSVIhBTJIkqRCDmCRJUiEGMUmSpEIMYpIkSYUYxCRJkgoxiEmSJBViEJMkSSrEICZJklSIQUySJKkQg5gkSVIhBrFh9Pf3s2bNmlG1efLJJ+np6dm5ftFFF3HCCSdwwgkncOWVV7a6i5IkqQmtuNevWLGCk046iRNPPJFzzz2X3/72tw3vyyA2jFacnE996lPcc8893H333dx88808/vjjre6mJEnaTa24119wwQXcdddd/PCHPwTgtttua3hfE0d15PEgEyKAWsLt6+ujq6uLj33sY3zjG9/g5Zdf5oADDuC6667jwAMP5IMf/CAvvfQSEcGqVatYsWIF9957L11dXVx99dV0dHQA8IY3vIEJEyYwYcKEkqOTJEmw837fynt9ZrJ9+3ZmzZo1mn7kbr+ADwMPAduBzmHqzQUeBTYBy+q2HwzcDjxWvR/UyHE7Ojpyj1i+PHPx4szt2zMz86dPPJGnHH545vLleeaZZ+a///u/Z2bmt771rfzMZz6TfX19efbZZ+9s/tprr+V3v/vd/PjHP77LrtesWZPnnnvunum3JElqXN39/qc//WmecsopmYsX55nHHLPb9/orrrgiZ82alfPmzcvf/OY3rysD1ucQmabZqckHgTOAu4aqEBETgK8A84CjgbMj4uiqeBmwLjNnA+uq9TIyob8furthyZLa+uc/D08/Df39bNiwgWXLltHV1cWXvvQlfvGLXzBnzhw6Ojo455xzWLx4MS+88MKgu77jjjtYvXo111xzzd4dkyRJer3B7vcbN0J3Nxu2bt3te/0ll1zCxo0bOfLII7nuuusa7k5TU5OZ+TBAVFN5QzgO2JSZT1R1bwAWAj+p3ruqequBXuBvmunTbouAlStry93d0N3N/sC2Qw+FlSs55swzufjii5kzZw4Ar776Kq+88gpLly4lIrjiiiu4/vrr6ejoYNu2bTt3+6Mf/YhLL72UtWvXcsABBxQYmCRJ2mnA/X7/7m62ASxezDFbtuzWvf6//uu/eOMb30hE8OY3v5lJkyY13p3aE7NmxxS9wF9n5vpByj4EzM3MT1TrHwGOz8wLIqI/MyfX1f1lZh40xDEWAYsAZs6c2fHUU0813e9BZcIbag8KtwML5s5l0qRJnH766dx44428+OKLAJx//vkcffTRXHjhhUycOJHt27ezevVqpk6dyvz585k+fTrLly/n7LPPBmDq1KkAr5tLliRJhVT3++3AAmDSGWfs9r3+mmuu4aGHHtr5+bB//Md/ZL/99tt5qIjoy8zOwboxYhCLiDuAtwxSdElm3lzV6WXoIPZh4L0Dgthxmfnp0QSxep2dnbl+/S6Hal5m7TFld/fvti1eXEvOwz/1kyRJ7WIv3++HC2IjfkYsM0/NzD8e5HVzg8ffDBxet34YsKVafiYiZlSdnAE82+A+W6/+pCxeDNu3197r55AlSVJ7G2P3+73x8xX3ArMj4kjgZ8BZwF9UZT3AecCV1Xuj4a71ImDy5Ncn4h1zyJMn+0RMkqR9wRi73zf1GbGI+ADwv4BpQD9wX2a+NyLeCnwtM+dX9eYDfw9MAK7NzC9U26cA3wBmAv8P+HBm/udIx91jU5Pwut8RG3RdkiS1v714v2/qM2Jj0R4NYpIkSS3U1GfEJEmStGcYxCRJkgoxiEmSJBViEJMkSSrEICZJklSIQUySJKkQg5gkSVIhbfk7YhHxHLCH/tXvnaYCv9jDxxjLxvP4x/PYYXyP37GPX+N5/ON57LB3xv/7mTltsIK2DGJ7Q0SsH+rH18aD8Tz+8Tx2GN/jd+zjc+wwvsc/nscO5cfv1KQkSVIhBjFJkqRCDGJDW1W6A4WN5/GP57HD+B6/Yx+/xvP4x/PYofD4/YyYJElSIT4RkyRJKmRcB7GI+HBEPBQR2yNiyG9MRMTciHg0IjZFxLK67QdHxO0R8Vj1ftDe6XnzGul7RPxRRNxX93ohIi6qyi6LiJ/Vlc3f64NoQqPnLiKejIgN1RjXj7b9WNTguT88Ir4bEQ9X18jiurK2O/dDXcN15RERX67KH4iItzfath00MP6/rMb9QETcHRF/Vlc26DXQLhoYe1dE/Kru7/PfNtq2HTQw/s/Wjf3BiHgtIg6uytr93F8bEc9GxINDlI+N6z4zx+0LOAr4I6AX6ByizgTgceAPgP2B+4Gjq7IvAsuq5WXAVaXHNIqxj6rv1Z/Dz6n9FgrAZcBflx7Hnh4/8CQwtdk/v7H0aqTvwAzg7dXym4CNdX/v2+rcD3cN19WZD6wFAjgB+FGjbcf6q8HxvwM4qFqet2P81fqg10A7vBocexfw7d1pO9Zfox0D8D7gzn3h3Ff9Pwl4O/DgEOVj4rof10/EMvPhzHx0hGrHAZsy84nMfBW4AVhYlS0EVlfLq4H375GO7hmj7fspwOOZuad/SHdvafbc7dPnPjO3ZuaPq+VfAw8Dh+6tDrbYcNfwDguBNVlzDzA5ImY02HasG3EMmXl3Zv6yWr0HOGwv93FPaeb8jYtzP8DZwNf3Ss/2gsy8C/jPYaqMiet+XAexBh0KPF23vpnf3ZCmZ+ZWqN24gEP2ct+aMdq+n8WuF+gF1ePca9tpaq7S6PgTuC0i+iJi0W60H4tG1feIOAKYA/yobnM7nfvhruGR6jTSdqwb7Rg+Tu0pwQ5DXQPtoNGx/3lE3B8RayPimFG2HcsaHkNETALmAjfWbW7nc9+IMXHdT9xTOx4rIuIO4C2DFF2SmTc3sotBtrXFV02HG/so97M/cDpwcd3mrwKfp/Zn8XngauD83evpntGi8Z+YmVsi4hDg9oh4pPq/rDGthef+96j9h/mizHyh2jzmz/0AjVzDQ9Vp2+u/TsNjiIh3UQti76zb3JbXQKWRsf+Y2kcuXqw+7/gtYHaDbce60YzhfcAPM7P+CVI7n/tGjInrfp8PYpl5apO72AwcXrd+GLClWn4mImZk5tbqceazTR6rpYYbe0SMpu/zgB9n5jN1+965HBH/BHy7FX1upVaMPzO3VO/PRsRN1B5Z38U4OPcRsR+1EPYvmfnNun2P+XM/wHDX8Eh19m+g7VjXyPiJiD8FvgbMy8znd2wf5hpoByOOve5/MMjMWyPiHyJiaiNt28BoxrDLrEebn/tGjInr3qnJkd0LzI6II6snQ2cBPVVZD3BetXwe0MgTtrFiNH3f5XMD1Q18hw8Ag34rZQwbcfwRcWBEvGnHMvAefjfOffrcR0QA/ww8nJkrBpS127kf7hreoQc4t/oW1QnAr6pp20bajnUjjiEiZgLfBD6SmRvrtg93DbSDRsb+lurvOxFxHLX74vONtG0DDY0hIt4MnEzdfwv2gXPfiLFx3e+pbwG0w4vaTWQz8ArwDPCdavtbgVvr6s2n9q2xx6lNae7YPgVYBzxWvR9cekyjGPugfR9k7JOo/UfpzQPaXw9sAB6o/oLOKD2mVo+f2jdm7q9eD42nc09taiqr83tf9Zrfrud+sGsY+CTwyWo5gK9U5Ruo+xb1UNd/O70aGP/XgF/Wnev11fYhr4F2eTUw9guqsd1P7YsK7xhP575a/yhww4B2+8K5/zqwFfgttXv9x8fide8v60uSJBXi1KQkSVIhBjFJkqRCDGKSJEmFGMQkSZIKMYhJkiQVYhCTJEkqxCAmSZJUiEFMkiSpkP8Pinu9TLhZx0QAAAAASUVORK5CYII=\n",
      "text/plain": [
       "<Figure size 720x360 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--------------------------------------------------------------------------------\n"
     ]
    }
   ],
   "source": [
    "# ---------------------\n",
    "# Run this sanity check\n",
    "# Note that this is not an exhaustive check for correctness.\n",
    "# The plot produced should look like the \"test solution plot\" depicted below. \n",
    "# ---------------------\n",
    "\n",
    "print (\"-\" * 80)\n",
    "print (\"Outputted Plot:\")\n",
    "\n",
    "M_reduced_plot_test = np.array([[1, 1], [-1, -1], [1, -1], [-1, 1], [0, 0]])\n",
    "word2ind_plot_test = {'test1': 0, 'test2': 1, 'test3': 2, 'test4': 3, 'test5': 4}\n",
    "words = ['test1', 'test2', 'test3', 'test4', 'test5']\n",
    "plot_embeddings(M_reduced_plot_test, word2ind_plot_test, words)\n",
    "\n",
    "print (\"-\" * 80)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font color=red>**Test Plot Solution**</font>\n",
    "<br>\n",
    "<img src=\"imgs/test_plot.png\" width=40% style=\"float: left;\"> </img>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 1.5: Co-Occurrence Plot Analysis [written] (3 points)\n",
    "\n",
    "Now we will put together all the parts you have written! We will compute the co-occurrence matrix with fixed window of 4 (the default window size), over the Reuters \"crude\" (oil) corpus. Then we will use TruncatedSVD to compute 2-dimensional embeddings of each word. TruncatedSVD returns U\\*S, so we need to normalize the returned vectors, so that all the vectors will appear around the unit circle (therefore closeness is directional closeness). **Note**: The line of code below that does the normalizing uses the NumPy concept of *broadcasting*. If you don't know about broadcasting, check out\n",
    "[Computation on Arrays: Broadcasting by Jake VanderPlas](https://jakevdp.github.io/PythonDataScienceHandbook/02.05-computation-on-arrays-broadcasting.html).\n",
    "\n",
    "Run the below cell to produce the plot. It'll probably take a few seconds to run. What clusters together in 2-dimensional embedding space? What doesn't cluster together that you might think should have?  **Note:** \"bpd\" stands for \"barrels per day\" and is a commonly used abbreviation in crude oil topic articles."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Running Truncated SVD over 8185 words...\n",
      "Done.\n"
     ]
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAmgAAAEvCAYAAADxWj0AAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8vihELAAAACXBIWXMAAAsTAAALEwEAmpwYAAArBUlEQVR4nO3de3hV1Z3H//eXm6AFoqKiIFK8gMo4yC+iUtoigtcpeClW61ikInZqNaU+bbG2mnpp7fxG09h27FgvOIO2o512wMvPFtFoHWwVFBUrqGOjBhFQAXVUFLJ+f5yTGGICCSeXneT9ep7znLPPXmvvtbMT/bDW2ntHSglJkiRlR7f2boAkSZK2ZECTJEnKGAOaJElSxhjQJEmSMsaAJkmSlDEGNEmSpIzp0d4N2B4DBgxIQ4cObe9mSJIkbdOSJUveSCnt1pw6HTKgDR06lMWLF7d3MyRJkrYpIl5ubh2HOCVJkjLGgCZJkpQxBjRJkqSMMaBJkiRljAFNkiRl3pw5c3j77bebXL6yspKJEye2YotalwFNkiRlXmMBbfPmze3QmtZnQJMkSW0npdqPlZWVHHbYYZx55pkUFxdTXl7Ohg0bOO200zj66KOZMGECL774Ig888ABLly5l6tSpXHDBBbX1zjrrLM4991z+/Oc/M3bsWMaNG8c//dM/kersA+DVV1/lxBNPZMKECZx44omsXbsWgP3226+2zMSJE6msrKSyspIxY8Ywffp0Ro4cyW233ca0adMYPXo0P/7xj9vmZ0QHvQ+aJEnqgEpLYf16KCuDCEiJymef5YGJE+k9Zw6HHXYYTz31FKeccgqnn346Tz31FLNnz+a3v/0to0aNYu7cuQwePLg2SC1cuJB+/fpRXFzMHXfcwbBhw/jqV7/KXXfdxSGHHFK7229/+9v84Ac/4IgjjmDevHn85Cc/4V/+5V8abebKlSt5+OGHWb9+PUOHDqWyspIBAwYwfPhwLr744tb/OWFAkyRJbSGlXDgrL88tl5XBFVcw4v336fv++9CjByNHjmTVqlWUl5fzy1/+EoAePRqOKiNHjqRfv34AbNiwgWHDhgEwduxYli9fvkVAe+aZZ5g9ezYAmzZt2qLn7OPmfdzrNmLECHr37s3AgQMZNGgQAwcOBKBPnz5s3ryZ7t27F/azaAIDmiRJan0RuVAGuZCWD2rL+/Th3SuuoPfmzSxbtozRo0czc+ZMTj75ZAA+/PBDAHr16sWmTZtqN1c3JPXv35+XXnqJYcOGsWjRIqZMmbLFrg8++GAuvvhiDj300C22WV1dzcaNG9m8eTPPPfdcnaZGg5+BTwyfthbnoEmSpLZRN6TlDT34YM6dOZMjjjiCadOmUVZWxh133MGECRM46qij+NnPfgbAKaecwjnnnMMPfvCDT2z2uuuu48wzz2TcuHH07NmTyZMnb7H+mmuu4bLLLmPChAlMmDCBO+64A4BvfOMbHHHEEZx//vkMHjy4lQ56+0RbJcGWVFxcnHwWpyRJHUxKMGtWbe9ZJTBj7725/+WXc+Gtk4qIJSml4ubUsQdNkiS1vrrhrKQEqqth+nR49dXc9x2ww6g1OQdNkiS1vggoKsqFs/xVnENvuon7+/XLfd+Je9C2h0OckiSp7aS0ZRirv9wJOcQpSZKyrX4Y6+ThbHsZ0CRJkjLGgCZJkpQxBjRJkqSMMaBJkqQO6/XXX+eiiy5qUtkZM2ZQUVHRrO3/93//N6+88sp2tKwwBjRJktRhDRw4kGuuuabVtt9YQNu8eXOr7RMMaJIkqaOpc4uwyspKJk6cSGlpKeeccw6TJ09m1KhRLF++HIA777yTUaNGceqpp/Lqq69uUadGzcPTKyoqGDNmDEcddRTTp0/nr3/9K/fddx8XXHABU6dOBWCfffbh61//OlOmTOFLX/oSTz75JAAvv/wykyZNarFD9Ea1kiSp4ygthfXra292S0rw/POwaRN9R43ipptu4vbbb+fGG2/kJz/5CZdccglLliyhd+/e/P3f//1WN/273/2OK6+8kmOOOYbq6mq6devGcccdx4wZMxg3bhwAq1atYvbs2QwZMoSFCxdy00038fOf/5xbbrmFc845p8UO0x40SZLUMaSUC2fl5R8/HuqKK3KPi/rgA/6f0aMBGDJkCG+++SZvvPEGe+yxB3379qVnz56Mzq+Pevdeq7lp/7e//W3mz5/PmWeeyS233NJgEwYNGsSQIUMAmDBhAo899hjvvfced911FyeffHKLHWqL9KBFxHFAOdAduDGldHW99ZFffwLwHnB2SumJ/LpK4B1gM7CpuXfalSRJXURErucMciEt/9B19t4bjj2W6PZxv1NKiQEDBrB69WreffddevfuzdKlSwHYeeedee2110gpsXr1alauXAnArrvuys9//nNSShxwwAFMnTqVXr16sWnTptrtdu/evU5zglNPPZWvf/3rfO5zn2OHHXZosUMtOKBFRHfgF8AkoAp4PCLmp5T+WqfY8cD++dfhwPX59xpHpZTeKLQtkiSpk6sJaTXhDOCAAxp8IkH37t25/PLLGTduHJ/+9KcZNGgQAP369eO4447jyCOPZMyYMeyxxx4AXHvttfzxj3+kurqaSZMm0a9fP/7hH/6BSy+9lAMPPJB/+7d/+8Q+pk+fzuDBg2vnorXYYRb6LM6IOBIoTSkdm1++GCCl9OM6Zf4NqEgp/Tq/vAIYn1Jale9BK25OQPNZnJIkdVEp5YY36wa0Og9gb2urV6/mjDPO4IEHHmi0THs9i3MQ8Gqd5ar8d00tk4A/RsSSiJjZAu2RJEmdUd1wVlIC1dW597pz0trQggULmDx5Mt///vdbfNstMQetobha/ye0tTKfSSm9FhG7AwsiYnlK6eFP7CQX3mYCtZPzJElSFxIBRUVb9pjVzEkrKmrzHrRJkya16K016mqJgFYF7F1neTDwWlPLpJRq3tdExO+BMcAnAlpK6QbgBsgNcbZAuyVJUkdTWprrKasJYzUhrR2GN1tTSwxxPg7sHxGfjohewOnA/Hpl5gNfiZwjgA35+Wc7RURfgIjYCTgGWNYCbZIkSZ1V/TDWycIZtEAPWkppU0R8A/gDudts3JxSejYivpZf/0vgXnK32HiR3G02puer7wH8Pn8/kh7A7Sml+wptkyRJUkdW8FWc7cGrOCVJUkfRXldxSpIkqQUZ0CRJkjLGgCZJkpQxBjRJkqSMMaBJkiRljAFNkiQpYwxokiRJGWNAkyRJyhgDmiRJUsYY0CRJkjLGgCZJkpQxBjRJkqSMMaBJkiRljAFNkiQpYwxokiRJGWNAkyRJyhgDmiRJUsYY0CRJkjLGgCZJkpQxBjRJkqSMMaBJkiRljAFNkiQpYwxokiRJGWNAkyRJypgWCWgRcVxErIiIFyNidgPrIyKuy69/OiJGN7WuJElSV1NwQIuI7sAvgOOBg4AzIuKgesWOB/bPv2YC1zejriRJUpfSEj1oY4AXU0ovpZQ+BH4DTKlXZgrw7ynnz0BRROzZxLqSJEldSksEtEHAq3WWq/LfNaVMU+pKkiR1KS0R0KKB71ITyzSlbm4DETMjYnFELF67dm0zmyhJktS6Nm/e3GLb6tEC26gC9q6zPBh4rYllejWhLgAppRuAGwCKi4sbDHGSJEnNdfHFF7No0SI+/PBDLrnkEhYvXsyrr77K2rVreeWVV/jNb37DiBEjeOihh7j00kuJCEaMGMH111/Pyy+/zNSpUxkxYgQ9e/bkoosuYvr06ey2227suuuuDBs2DKB3RPx3SukkgIi4GbglpfSnxtrUEj1ojwP7R8SnI6IXcDowv16Z+cBX8ldzHgFsSCmtamJdSZKklpVyfT333Xcf69at46GKChYuXMgll1xCSom+ffsyf/58vvOd73DjjTeSUuKb3/wm8+fPp6Kigj59+nDPPfcAUFlZyS9+8QtuvvlmLr74Yq677jruuecedthhh5q9fQD0i4iBEfEp4O+2Fs6gBXrQUkqbIuIbwB+A7sDNKaVnI+Jr+fW/BO4FTgBeBN4Dpm+tbqFtkiRJalRpKaxfD2VlPPPMMzz00EOM33tv6NGDjb178+abb3L44YcDMGTIEBYsWMAbb7xBZWUlU6bkrmV89913GT58OCNHjmTkyJH069cPgBdffJHDDjsMgMMPP5yqqqqavd4CnA2sAf5zW01siSFOUkr3kgthdb/7ZZ3PCTi/qXUlSZJaRUq5cFZeDsDBkyZxTO/elC9fDiUlfPiTn/CjH/+YiKhTJTFgwACGDRvG3Xffzac+9SkAPvroI1auXEn37t1ry+67774sXryYww8/nMcff5w999yzZtWdwEPkOqpO21YzWySgSZIkdQgRUFaW+1xezgnl5TwKjB80iHjqKQbPmMG+++7bQLXg2muvZfLkyaSU6NatG2VlZbU9ZzV+9KMf8dWvfpUBAwbQv39/9tlnHwBSSh9ExJ+BvVJK27zaMVLqePPti4uL0+LFi9u7GZIkqaNKCbrVmYpfXZ0LbwX66KOP6NmzJwDnnnsuxx57LFOnTl2SUiqOiJ8C96SUFmxrOz6LU5IkdS0pwaxZW343a1bthQOFeOaZZ/jsZz/LkUceybvvvstJJ50EQETcCgxpSjgDhzglSVJXUhPOysuhpCQ33FmzDLnlAnrSRo8ezZ/+9MkLNFNK05qzHQOaJEnqOiKgqOjjcFZ3TlpRUYsMc7YE56BJkqSuJ6Utw1j95RYUEUtSSsXNqeMcNEmS1PXUD2MZ6TmrYUCTJEnKGAOaJElSxhjQJEmSMsaAJkmSlDEGNEmSpIwxoEmSJGWMAU2SJCljDGiSJEkZY0CTJEnKGAOaJElSxhjQJEmSMsaAJkmSlDEGNEmSpIwxoEmSJGWMAU2SJHVaV199Nc888wwA++23Xzu3pul6tHcDJEmSWsvs2bPbuwnbxR40SZLUKaSUOO+88xg3bhxjx47lscce4+yzz+aRRx5p76Y1mz1okiSp40oJIgCYN28eH334IY888ggvvfQSp59+OgcddFA7N3D7FNSDFhG7RMSCiHgh/75zI+WOi4gVEfFiRMyu831pRKyMiKX51wmFtEeSJHUhpaUwa1YupAErli9n7KuvQmkpw4YNY926de3bvgIUOsQ5G1iYUtofWJhf3kJEdAd+ARwPHAScERF142xZSmlU/nVvge2RJEldQUqwfj2Ul9eGtOH/8z8sWrgQ1q/npf/9X4qKitq7ldut0IA2Bbg1//lW4KQGyowBXkwpvZRS+hD4Tb6eJEnS9omAsjIoKcmFtG7dmHz33XQ/+GDGLV7Mmf/4j/zsZz9r71Zut0j5bsHtqhyxPqVUVGd5XUpp53plvggcl1KakV8+Czg8pfSNiCgFzgbeBhYDF6WUttkfWVxcnBYvXrzd7ZYkSZ1EStCtTn9TdXXtnLSsiIglKaXi5tTZZg9aRNwfEcsaeDW1F6yhn1JNKrwe2BcYBawCrtlKO2ZGxOKIWLx27dom7lqSJHVaKeWGN+uqMyetI9tmQEspTUwpjWzgNQ9YHRF7AuTf1zSwiSpg7zrLg4HX8ttenVLanFKqBn5Fbji0sXbckFIqTikV77bbbk0/QkmS1PnUhLPy8twwZ3X1x8OdnSCkFXqbjfnANODq/Pu8Bso8DuwfEZ8GVgKnA1+GXKhLKa3KlzsZWFZgeyRJUlcQAUVFuVBWVvbxnDTIfZ+xYc7mKnQO2q7AHcAQ4BVgakrprYjYC7gxpXRCvtwJwE+B7sDNKaWr8t//B7nhzQRUAufVCWyNcg6aJEkCtrgPWoPLGbA9c9AKCmjtxYAmSZI6ila5SECSJElty4AmSZKUMQY0SZKkjDGgSZIkZYwBTZIkKWMMaJIkSRljQJMkScoYA5okSVLGGNAkSZIyxoAmSZJa3euvv85FF13U3s3oMAxokiSp1Q0cOJBrrrlmi+82b97cTq3Jvh7t3QBJktT5VVZWMmPGDMaNG0dlZSVvvfUWZ5xxBkuXLuWxxx5jw4YNfO1rX2PmzJm8++67fOlLX2Ljxo0ccsghPPHEE1RUVLT3IbQpA5okSWodKUHEJ77eYYcdmD9/PgCTJ09mp512YuPGjfzd3/0d06dP51e/+hXjxo3j4osv5rbbbuOJJ55o65a3O4c4JUlSyysthVmzciENcu/PPw8VFYwdO7a22PXXX8+4ceM45phjWLNmDWvWrOH5559nzJgxABx++OHt0Pj2Z0CTJEktKyVYvx7Kyz8OaVdcAa++Ch98QPduufixbt06br75Zh566CH+8Ic/0L9/f1JK7L///ixevBiAxx9/vB0PpP04xClJklpWBJSV5T6Xl+deAHvvDcceWzvsWVRUxMEHH8y4ceM48MAD2XXXXQE499xzOe2001iwYAEjR45sjyNod5Fquh47kOLi4lSTrCVJUkalBN3qDNZVVzc4J60xVVVVTJ06lR122KFDXyQQEUtSSsXNqeMQpyRJankp5YY366o7J62FdbZbdjjEKUmSttvFF1/MokWL+PDDD7nkkks45JBDmDlzJu8vW0aPlStZcOGFfHXDBmb83/8xrrycucuX8+Lhh1P6wx/y3e9+d5u32KjpPXv++eeZOXMmKSUGDhzInDlz6NOnD/vssw8nnngir7zyCnfffXd7/zhajAFNkiQ1XZ1bZ9x3332se+stHnroId577z2OPPJIDjjgAL71rW9xzKJFVK9bR7ef/hSmT4cLL4RBg+CVV2rrX3rppU2+xcZ3vvMdLr/8cj73uc9x+eWX86tf/YoLL7yQVatWMXv2bIYMGdJeP5FW4RCnJElqmnq3znjm6ad56M47GT90KCeccAIbN27kr3/9K0cddRSUlubCWQQRUXvhQDr55NrNNecWG88//3zt7TnGjh3L8uXLARg0aFCnC2dgQJMkSU3RwK0zDv6f/+GYdeuoOOkkKh58kKeffpqDDz64dkJ/dT7I7bLLLlRVVUEES5YsAZp/i40DDjiARYsWAbBo0SKGDx8OQPfu3dvoB9C2vIpTkiQ1Tc3E/5rbZgA/GDOGP/XpQ0QwePBgrrrqKs4991w++OADevbsyR//+EdWrFjBGWecwZAhQxgwYABDhgzhsssu47TTTqOqqooDDzyQpUuXMn/+fPr3789pp53GRx99xMiRI1m6dCkVFRUsX76c8847j5QSu+++O//xH/9Bnz592G+//XjxxRfb8YeybdtzFacBTZIkNV2Bt87oirzNhiRJaj1tfOuMrqyggBYRu0TEgoh4If++cyPlbo6INRGxbHvqS5KkdlZ3eLOkJNdzVlKy5eOc1GIK7UGbDSxMKe0PLMwvN2QOcFwB9SVJUnuKgKKiXCgrK/v4cU4lJbnvHeZsUQXNQYuIFcD4lNKqiNgTqEgpDW+k7FDg7pTSyO2pX5dz0CRJaid17oPW4LI+oT3moO2RUloFkH/fvY3rS5KktlQ/jBnOWsU2nyQQEfcDAxtYdUnLN2er7ZgJzAQ65Q3pJEmSamwzoKWUJja2LiJWR8SedYYo1zRz/02un1K6AbgBckOczdyPJElSh1HoEOd8YFr+8zRgXhvXlyRJ6nQKDWhXA5Mi4gVgUn6ZiNgrIu6tKRQRvwYeBYZHRFVEnLO1+pIkqfkqKyuZOLHRga+CzJkzhwULFgBw3XXXtco+9DGfJCBJUidRWVnJjBkzuP/++1t1Px3h8UpZ4pMEJEkSAD//+c858MADmT59eu13++23HwBHH300b731Fs888wy9evXinXfe4fHHH2fmzJkAHHvssYwfP54xY8bw6KOPAlBaWsrcuXO5/fbbWblyJePHj+eqq65q+wPrIrZ5kYAkScqwBu5D9r3vfY9evXpx/fXXM3fu3E9UGT9+PA8++CBVVVUcf/zxPPzwwyxbtoyjjjoKgN/97nfstNNOPPfcc5x//vk88MADtXW//OUvc+mll1JRUdGqh9XVGdAkSeqoSkth/fqP7+yfEs8++ihvPfssf3755drer/qOPvpo5s6dyxtvvMFll13G3Llzee6557jlllt4//33KSkpYcWKFXTv3p2VK1e26SEpxyFOSZI6opRy4azuszCvuIKD33uPS0aP5rTTTmOXXXahqqoKgKVLl7Jp0yYAxowZw1/+8hc2btzI6NGjefbZZ3nzzTcZOHAg9913H927d+dPf/oT//qv/0pDc9V79OhBdXV1Wx5tl2MPmiRJHVHNszAhF9LKy3Of996bU+++m5533cX3vvc9+vbty+c//3k+//nP06NH7n/7PXr0YODAgYwaNQqAgQMHsv/++wNw5JFH8uMf/5iJEyfymc98psFdf/GLX+TEE0/k+OOP58ILL2zVw+yqvIpTkqSOLCXoVmdArLraxy9ljFdxSpLUlaSUG96sq2a4Ux2aAU2SpI6oJpyVl0NJSa7nrKRkyzlp6rCcgyZJUkcUAUVFuVBWcxVnzZy0oiKHOTs456BJktSR1b8PWgP3RVP7cg6aJEldTf0wZjjrFAxokiRJGWNAkyRJyhgDmiRJUsYY0CRJkjLGgCZJkpQxBjRJkqSMMaBJkiRljAFNkiQpYwxokiRJGWNAkyRJyhgDmiRJUsYY0CRJkjLGgCZJkpQxBjRJkqSMKSigRcQuEbEgIl7Iv+/cSLmbI2JNRCyr931pRKyMiKX51wmFtEeSJKkzKLQHbTawMKW0P7Awv9yQOcBxjawrSymNyr/uLbA9kiRJHV6hAW0KcGv+863ASQ0VSik9DLxV4L4kSZK6hEID2h4ppVUA+ffdt2Mb34iIp/PDoA0OkQJExMyIWBwRi9euXbu97ZUkScq8bQa0iLg/IpY18JrSAvu/HtgXGAWsAq5prGBK6YaUUnFKqXi33XZrgV1LkiRlU49tFUgpTWxsXUSsjog9U0qrImJPYE1zdp5SWl1nW78C7m5OfUmSpM6o0CHO+cC0/OdpwLzmVM6HuhonA8saKytJktRVFBrQrgYmRcQLwKT8MhGxV0TUXpEZEb8GHgWGR0RVRJyTX/XPEfFMRDwNHAXMKrA9kiRJHd42hzi3JqX0JnB0A9+/BpxQZ/mMRuqfVcj+JUmSOiOfJCBJkpQxBjRJkqSMMaBJkiRljAFNkiQpYwxokiRJGWNAkyRJyhgDmiRJUsYY0CRJkjLGgCZJapbrrrtuu+vOmTOHt99+uwVbI3VOBjRJUrMY0KTWZ0CTJJFS4rzzzmPcuHGMHTuWxx57jPHjx1NVVQXAlVdeyZw5c7j99ttZuXIl48eP56qrrqKiooJjjz2WU089lVGjRnHnnXcCcPbZZ/PII48AMHfuXEpLS3nggQdYunQpU6dO5YILLmi3Y5U6goKexSlJ6uBSggjmzZvHRx99xCN/+hMv/e1vnH766ey4446fKP7lL3+ZSy+9lIqKCgAqKipYuXIlTz75JO+//z7FxcWceuqpDe5qwoQJjBo1irlz5zJ48ODWPCqpwzOgSVJXVVoK69dDWRkrVqxg7JFHwqxZDCsqYt26dey00061RVNKjW7m0EMPpWfPnvTs2ZPdd9+dtWvXEhFNqiupYQ5xSlJXlFIunJWXw6xZDD/gABb99KdQXs5LL79MUVERu+yyS+0Q55IlS2qr9ujRg+rq6trlpUuXsmnTJt555x1Wr17NgAEDGq3bq1cvNm3a1CaHKHVk9qBJUlcUAWVluc/l5UwuL+ceYNyee7J5+XJ+9rOfsXHjRmbMmMEBBxzADjvsUFv1i1/8IieeeCLHH388hxxyCHvttRdTp07lb3/7G1deeSXdu3dnxowZnHHGGdx+++0MGDCAoqIiAE455RTOOeccxo4dyxVXXNH2xy11ENERu56Li4vT4sWL27sZktTxpQTd6gymVFfnwlsTVVRUMHfuXG688cZWaJzUOUTEkpRScXPqOMQpSV1VSjBr1pbfzZqV+15SuzKgSVJXVBPOysuhpCTXc1ZSUjsnrakhbfz48faeSa3AOWiS1BVFQFFRLpSVlW05J62oqFnDnJJannPQJKkry98HrdFlSQVzDpokqXnqhzHDmZQJBjRJkqSMMaBJkiRljAFNkiQpYwoKaBGxS0QsiIgX8u87N1Bm74h4MCKei4hnI6KkOfUlSZK6mkJ70GYDC1NK+wML88v1bQIuSikdCBwBnB8RBzWjviRJUpdSaECbAtya/3wrcFL9AimlVSmlJ/Kf3wGeAwY1tb4kSVJXU2hA2yOltApyQQzYfWuFI2IocCjwl+2pL0mS1BVs80kCEXE/MLCBVZc0Z0cR8Sngv4BvppTebk7dfP2ZwEyAIUOGNLe6JElSh7HNgJZSmtjYuohYHRF7ppRWRcSewJpGyvUkF85uSyn9rs6qJtXPt+MG4AbIPUlgW+2WJEnqqAod4pwPTMt/ngbMq18gIgK4CXgupXRtc+tLkiR1NYUGtKuBSRHxAjApv0xE7BUR9+bLfAY4C5gQEUvzrxO2Vl+SJKkr2+YQ59aklN4Ejm7g+9eAE/KfHwEafLhbY/UlSZK6Mp8kICnzKisrmTix0emwmd22JG0vA5qkTqm6unqL5c2bN7dTSySp+Qoa4pSktrJhwwbOPPNMVqxYwVlnncUhhxzC5ZdfzqZNm9hll134z//8T3r37s1+++3HaaedxqOPPsq3v/1tysvL6devH/vuuy/HH388l156KRHBiBEjuP7667fYR1lZGb/5zW/YcccdOemkkygpKWmkNZLUugxokrIrJYjcFNbKykoeWLiQ3n36cNhhhzFv3jwefPBBAL773e9yxx138JWvfIVNmzbxhS98gR/96EdUVFTw2muvcffdd9OjRw9Gjx5NRUUF/fv3Z9asWdxzzz2MHDmydne33XYbDz74IH379v1ED5wktSUDmqRsKi2F9euhrAyAESNG0PfSS6GoiJEjR/L6669z7rnnsnHjRlavXk2/fv0A6N69O0cccUTtZoqLi+nZsydr166lsrKSKVOmAPDuu+8yfPjwLQLaT3/6Uy688EI2bdrEeeedx7hx49rscCWpLgOapOxJKRfOystzyyUlLF+yhHcfeYTeF1zAsmXLKC0t5Yc//CFHHnkk3/nOd0gpd//qiCDi4wvHu3fvDsCAAQMYNmwYd999N5/61KcA+Oijj1i5cmVt2dGjRzNu3DiqqqqYMmUKS5YsaZvjlaR6DGiSsieitueM8nIoL2cocO4BB/DCokVMmzaNgQMHcs455zB8+HD69+9f24PW+CaDa6+9lsmTJ5NSolu3bpSVlW1R76yzzuKNN97ggw8+4Pzzz2+945OkbYiaf3V2JMXFxWnx4sXt3QxJrS0l6FbnYvPq6to5aZLUUUTEkpRScXPqeJsNSdmUEsyateV3s2blvpekTs6AJil7asJZeTmUlOR6zkpKcsuGNEldgHPQJGVPBBQV5UJZWdmWc9KKihzmlNTpOQdNUnbVuQ9ag8uS1AE4B01S51I/jBnOJHURBjRJkqSMMaBJkiRljAFNkiQpYwxokiRJGWNAkyRJyhgDmiRJUsYY0CRJkjLGgCZJkpQxBjRJkqSMMaBJkiRljAFNkiQpYwxokiRJGVNQQIuIXSJiQUS8kH/fuYEye0fEgxHxXEQ8GxElddaVRsTKiFiaf51QSHskSZI6g0J70GYDC1NK+wML88v1bQIuSikdCBwBnB8RB9VZX5ZSGpV/3VtgeyRJkjq8QgPaFODW/OdbgZPqF0gprUopPZH//A7wHDCowP1KkiR1WoUGtD1SSqsgF8SA3bdWOCKGAocCf6nz9Tci4umIuLmhIVJJkqSuZpsBLSLuj4hlDbymNGdHEfEp4L+Ab6aU3s5/fT2wLzAKWAVcs5X6MyNicUQsXrt2bXN2LUmS1KH02FaBlNLExtZFxOqI2DOltCoi9gTWNFKuJ7lwdltK6Xd1tr26TplfAXdvpR03ADcAFBcXp221W5IkqaMqdIhzPjAt/3kaMK9+gYgI4CbguZTStfXW7Vln8WRgWYHtkSRJ6vAKDWhXA5Mi4gVgUn6ZiNgrImquyPwMcBYwoYHbafxzRDwTEU8DRwGzCmyPJElSh7fNIc6tSSm9CRzdwPevASfkPz8CRCP1zypk/5IkSZ2RTxKQJEnKGAOaJElSxhjQJEmSMsaAJkmSlDEGNEmSpIwxoEmSJGWMAU2SJCljDGiSJEkZY0CTJEnKGAOaJElSxhjQJEmSMsaAJkmSlDEGNEmSpIwxoEmSJGWMAU2SJCljDGiSJEkZY0DbisrKSiZOnNjseldeeSVz5sxp+QZJkqQuwYAmSZKUMT3auwGZkxJE1C5u2LCBM888kxUrVnDWWWfRv39/7rnnHj744AOqqqq47rrr+OxnP8vDDz/MhRdeyJAhQ9hhhx0YPHhwOx6EJEnqyOxBq6u0FGbNyoU0gJSofPZZfjlkCI8++ii33HILa9as4Z133uGuu+7i97//PbNmzQLgW9/6FvPnz2fevHls2LCh/Y5BkiR1ePag1UgJ1q+H8vLcclkZXHEFI95/n77vvw89ejBy5EhSShx22GEADB06tDaMvf322wwZMgSAMWPGtMcRSJKkTsIetBoRuVBWUpILad26wS23sLxPH9694go2bd7MsmXLiAiWLFkCwCuvvEK/fv0A6Nu3L1VVVQA8/vjj7XYYkiSp47MHra6akFbTiwYMPfhgzp05kxdeeIFp06ax8847s+OOO3LiiSfy2muvUVZWBsA111zDF77wBfbaay/69u3bXkcgSZI6AQNaXSnl5qDlDQUe/8xncqEtf+HAnDlzGDVqFN///ve3qDp+/HiefPLJNmysJEnqrBzirFETzsrLc8Oc1dUfD3fWvXBAkiSplRXUgxYRuwD/Sa6zqRI4LaW0rl6Z3sDDwA75/f02pXRZU+u3mQgoKsqFspoes/zwJUVFtT1oZ599drs0T5IkdR2RCugZioh/Bt5KKV0dEbOBnVNK361XJoCdUkrvRkRP4BGgJKX056bUb0hxcXFavHjxdrd7q+rdB+0Ty5IkSc0QEUtSSsXNqVPoEOcU4Nb851uBk+oXSDnv5hd75l81qXCb9dtc/TBmOJMkSW2s0IC2R0ppFUD+ffeGCkVE94hYCqwBFqSU/tKc+pIkSV3JNuegRcT9wMAGVl3S1J2klDYDoyKiCPh9RIxMKS1rcitz7ZgJzARqbwgrSZLUGW0zoKWUJja2LiJWR8SeKaVVEbEnuR6yrW1rfURUAMcBy4Am108p3QDcALk5aNtqtyRJUkdV6BDnfGBa/vM0YF79AhGxW77njIjoA0wElje1viRJUldTaEC7GpgUES8Ak/LLRMReEXFvvsyewIMR8TTwOLk5aHdvrb4kSVJXVtB90FJKbwJHN/D9a8AJ+c9PA4c2p74kSVJX5pMEJEmSMqagG9W2l4hYC7zc3u1oJwOAN9q7EWpRntPOyfPa+XhOO6e2OK/7pJR2a06FDhnQurKIWNzcuxEr2zynnZPntfPxnHZOWT2vDnFKkiRljAFNkiQpYwxoHc8N7d0AtTjPaefkee18PKedUybPq3PQJEmSMsYeNEmSpIwxoGVERBwXESsi4sWImN3A+p0j4vcR8XREPBYRI+usK4qI30bE8oh4LiKObNvWqzHbe14jYnhELK3zejsivtnmB6BPKPBvdVZEPBsRyyLi1xHRu21br8YUeF5L8uf0Wf9OsyMibo6INRGxrJH1ERHX5c/50xExus66rf4+tImUkq92fgHdgf8FhgG9gKeAg+qV+X+By/KfRwAL66y7FZiR/9wLKGrvY/JV+Hmtt53Xyd1Hp92Pqyu/CjmnwCDgb0Cf/PIdwNntfUy+Cj6vI4FlwI7kns5zP7B/ex+TrwTwOWA0sKyR9ScA/x8QwBHAX5r6+9AWL3vQsmEM8GJK6aWU0ofAb4Ap9cocBCwESCktB4ZGxB4R0Y/cL+FN+XUfppTWt1nLtTXbfV7rlTka+N+UUle9OXOWFHpOewB9IqIHuf+hv9Y2zdY2FHJeDwT+nFJ6L6W0CXgIOLntmq7GpJQeBt7aSpEpwL+nnD8DRRGxJ037fWh1BrRsGAS8Wme5Kv9dXU8BpwBExBhgH2AwuYS/FrglIp6MiBsjYqfWb7KaoJDzWtfpwK9bqY1qnu0+pymllcC/AK8Aq4ANKaU/tnqL1RSF/K0uAz4XEbtGxI7kemX2bvUWqyU0dt6b8vvQ6gxo2RANfFf/8tqrgZ0jYilwAfAksIncv8hHA9enlA4F/g9on/Fy1VfIec1tIKIXMBm4s5XaqObZ7nMaETuT+1f4p4G9gJ0i4h9bsa1quu0+ryml54CfAAuA+8gFuU2oI2jsvDfl96HV9WjrHapBVWz5L67B1Bv6SCm9DUyH3MRGcnNZ/kZumKQqpfSXfNHfYkDLikLOa43jgSdSSqtbt6lqokLO6bHA31JKa/PrfgeMBea2frO1DQX9raaUbiI/zSQifpTfnrKvsfPeq5Hv25Q9aNnwOLB/RHw632NyOjC/boH8lZq98oszgIdTSm+nlF4HXo2I4fl1RwN/bauGa6u2+7zWKXIGDm9mSSHn9BXgiIjYMf8/+KOB59qw7WpcQX+rEbF7/n0IuWFQ/2Y7hvnAV/JXcx5BbtrBKprw+9AW7EHLgJTSpoj4BvAHcleP3JxSejYivpZf/0tyE1H/PSI2kwtg59TZxAXAbflfpJfI/ytP7avQ85qfzzIJOK/NG68GFXJOU0p/iYjfAk+QGwJ7kozewbyraYH/Bv9XROwKfAScn1Ja17ZHoIZExK+B8cCAiKgCLgN6Qu05vZfcnMEXgffI/7+zsd+HNm9//pJSSZIkZYRDnJIkSRljQJMkScoYA5okSVLGGNAkSZIyxoAmSZKUMQY0SZKkjDGgSZIkZYwBTZIkKWP+f5E7dHoeWrAIAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<Figure size 720x360 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "# -----------------------------\n",
    "# Run This Cell to Produce Your Plot\n",
    "# ------------------------------\n",
    "reuters_corpus = read_corpus()\n",
    "M_co_occurrence, word2ind_co_occurrence = compute_co_occurrence_matrix(reuters_corpus)\n",
    "M_reduced_co_occurrence = reduce_to_k_dim(M_co_occurrence, k=2)\n",
    "\n",
    "# Rescale (normalize) the rows to make them each of unit-length\n",
    "M_lengths = np.linalg.norm(M_reduced_co_occurrence, axis=1)\n",
    "M_normalized = M_reduced_co_occurrence / M_lengths[:, np.newaxis] # broadcasting\n",
    "\n",
    "words = ['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'iraq']\n",
    "\n",
    "plot_embeddings(M_normalized, word2ind_co_occurrence, words)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### <font color=\"red\">Write your answer here.</font>\n",
    "\n",
    "- From the plot above, there's three clusters that we can observe:\n",
    "    - petroleum, industry\n",
    "    - energy, oil\n",
    "    - ecuador, iraq, kuwait\n",
    "- Countries that are major exporters of oil have been clustered together. This makes sense as co-occurence matrices group general topics and since oil countries likely share similar co-occurence words in the reuter articles corpus, they are expected to cluster together here. \n",
    "- \"bpd\", \"barrels\" and \"output\" should cluster together, but apparently they have less co-occurance in given datasets. Also, petroleum and oil could be clustered more closely together as crude oil and petroleum are sometimes used synonymously."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 2: Prediction-Based Word Vectors (15 points)\n",
    "\n",
    "As discussed in class, more recently prediction-based word vectors have demonstrated better performance, such as word2vec and GloVe (which also utilizes the benefit of counts). Here, we shall explore the embeddings produced by GloVe. Please revisit the class notes and lecture slides for more details on the word2vec and GloVe algorithms. If you're feeling adventurous, challenge yourself and try reading [GloVe's original paper](https://nlp.stanford.edu/pubs/glove.pdf).\n",
    "\n",
    "Then run the following cells to load the GloVe vectors into memory. **Note**: If this is your first time to run these cells, i.e. download the embedding model, it will take a couple minutes to run. If you've run these cells before, rerunning them will load the model without redownloading it, which will take about 1 to 2 minutes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "def load_embedding_model():\n",
    "    \"\"\" Load GloVe Vectors\n",
    "        Return:\n",
    "            wv_from_bin: All 400000 embeddings, each lengh 200\n",
    "    \"\"\"\n",
    "    import gensim.downloader as api\n",
    "    wv_from_bin = api.load(\"glove-wiki-gigaword-200\")\n",
    "    print(\"Loaded vocab size %i\" % len(wv_from_bin.vocab.keys()))\n",
    "    return wv_from_bin"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loaded vocab size 400000\n"
     ]
    }
   ],
   "source": [
    "# -----------------------------------\n",
    "# Run Cell to Load Word Vectors\n",
    "# Note: This will take a couple minutes\n",
    "# -----------------------------------\n",
    "wv_from_bin = load_embedding_model()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Note: If you are receiving a \"reset by peer\" error, rerun the cell to restart the download. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Reducing dimensionality of Word Embeddings\n",
    "Let's directly compare the GloVe embeddings to those of the co-occurrence matrix. In order to avoid running out of memory, we will work with a sample of 10000 GloVe vectors instead.\n",
    "Run the following cells to:\n",
    "\n",
    "1. Put 10000 Glove vectors into a matrix M\n",
    "2. Run `reduce_to_k_dim` (your Truncated SVD function) to reduce the vectors from 200-dimensional to 2-dimensional."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_matrix_of_vectors(wv_from_bin, required_words=['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'iraq']):\n",
    "    \"\"\" Put the GloVe vectors into a matrix M.\n",
    "        Param:\n",
    "            wv_from_bin: KeyedVectors object; the 400000 GloVe vectors loaded from file\n",
    "        Return:\n",
    "            M: numpy matrix shape (num words, 200) containing the vectors\n",
    "            word2ind: dictionary mapping each word to its row number in M\n",
    "    \"\"\"\n",
    "    import random\n",
    "    words = list(wv_from_bin.vocab.keys())\n",
    "    print(\"Shuffling words ...\")\n",
    "    random.seed(224)\n",
    "    random.shuffle(words)\n",
    "    words = words[:10000]\n",
    "    print(\"Putting %i words into word2ind and matrix M...\" % len(words))\n",
    "    word2ind = {}\n",
    "    M = []\n",
    "    curInd = 0\n",
    "    for w in words:\n",
    "        try:\n",
    "            M.append(wv_from_bin.word_vec(w))\n",
    "            word2ind[w] = curInd\n",
    "            curInd += 1\n",
    "        except KeyError:\n",
    "            continue\n",
    "    for w in required_words:\n",
    "        if w in words:\n",
    "            continue\n",
    "        try:\n",
    "            M.append(wv_from_bin.word_vec(w))\n",
    "            word2ind[w] = curInd\n",
    "            curInd += 1\n",
    "        except KeyError:\n",
    "            continue\n",
    "    M = np.stack(M)\n",
    "    print(\"Done.\")\n",
    "    return M, word2ind"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Shuffling words ...\n",
      "Putting 10000 words into word2ind and matrix M...\n",
      "Done.\n",
      "Running Truncated SVD over 10010 words...\n",
      "Done.\n"
     ]
    }
   ],
   "source": [
    "# -----------------------------------------------------------------\n",
    "# Run Cell to Reduce 200-Dimensional Word Embeddings to k Dimensions\n",
    "# Note: This should be quick to run\n",
    "# -----------------------------------------------------------------\n",
    "M, word2ind = get_matrix_of_vectors(wv_from_bin)\n",
    "M_reduced = reduce_to_k_dim(M, k=2)\n",
    "\n",
    "# Rescale (normalize) the rows to make them each of unit-length\n",
    "M_lengths = np.linalg.norm(M_reduced, axis=1)\n",
    "M_reduced_normalized = M_reduced / M_lengths[:, np.newaxis] # broadcasting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Note: If you are receiving out of memory issues on your local machine, try closing other applications to free more memory on your device. You may want to try restarting your machine so that you can free up extra memory. Then immediately run the jupyter notebook and see if you can load the word vectors properly. If you still have problems with loading the embeddings onto your local machine after this, please go to office hours or contact course staff.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 2.1: GloVe Plot Analysis [written] (3 points)\n",
    "\n",
    "Run the cell below to plot the 2D GloVe embeddings for `['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'iraq']`.\n",
    "\n",
    "What clusters together in 2-dimensional embedding space? What doesn't cluster together that you think should have? How is the plot different from the one generated earlier from the co-occurrence matrix? What is a possible cause for the difference?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAmoAAAEvCAYAAAD1r+09AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8vihELAAAACXBIWXMAAAsTAAALEwEAmpwYAAAk80lEQVR4nO3deZQV1b238edHA1EJ0CpRVOQiAQcgSnhbFMRXFHAiEWOCcYgBRdH7+oYOZpmoydW+UaNZK4qNSczyehXvS4wZbgw4xDi2Q5wARcAg6tVGGlQgDBGDCPZ+/zin2wYbaTg9VDfPZ61e51TVPrV31QH6y967qiKlhCRJkrKnXUs3QJIkSfUzqEmSJGWUQU2SJCmjDGqSJEkZZVCTJEnKKIOaJElSRrVv6QZ8lm7duqVevXq1dDMkSZK2ac6cOStTSl9ozH1mOqj16tWL2bNnt3QzJEmStikiFjf2Ph36lCRJyiiDmiRJUkYZ1CRJkjLKoCZJkpRRBjVJktTmvfvuu3zve99rUNnzzz+fioqK7dr/n/70J4CO29uubTGoSZKkNq979+7ccMMNjbKvjz/++FPrthbUIqKokLoipVTI55tUSUlJ8vYckiRph6QEEQBUVlYyfPhwUkqsX7+eAw44gMWLF3PUUUexceNG5s2bx6677kq/fv1YsmQJGzZsYNddd6WqqoolS5awePFiDj74YMaOHcvKlSupqqpi8eLFdOnShUGDBvHYY4/xwQcffAisSintl79Vxz+B1cAS4PqU0ksR8S/AbSmlUQ05BHvUJElS21NWBt/9bi6sAU9UVLBp5UrOPeAAxo4dy4cffsjQoUOprKzknnvuYcOGDYwaNYrf/OY3zJs3j2uvvZa7776bdu3acf/99wOwceNGfvGLX9C3b186derEQw89xNtvv83ee+9Nnz59ABYDiyKiO7AP8FFKaShwKzAh37Jzgf9s6GFk+oa3kiRJ2y0lePBBeP753PJNN/Hqz3/O6g8+YNqzz7Lre+9RM6LYqVMnVq5cyV577cX777/P2rVrqa6u5kc/+hG77LILa9asoaqqigEDBtCxY0e6dOnCpZdeymGHHcbUqVMZMWIERxxxBLNmzaqp/Q5gPLAW+H/5dY8B10XEbsBXgesaeij2qEmSpLbniCNyr1OnQrt2HDhnDnsC4wcO5IdXXMG8efMoLi4mpUS3bt1YtWoVH330EcXFxRQVFXHdddfx0EMP0aNHD8477zxWrFjBRx99BMCee+7JUUcdxaRJk7j++uv561//SlFREUAAvwe+BnQC7gRIuVT438AvgSdTShsaehiNEtQi4sSIWBQRb0TEZfVsj4iYmt8+LyIGNUa9kiRJnxIBN90EkybVrjoWKOrShWnvvce1P/kJEyZMqN1WVFTEhAkT+POf/8zpp5/Ol770Ja644grGjBnDunXrKCkp4Re/+AUdO+auFbjxxhtZtmwZo0aNYsOGDaxfv55+/foB7AuUA88B1SmlFXVadQdwFnDbdh1KoRcT5K9meA0YBVQBs4AzU0p/q1PmZOA7wMnAEUB5SumIbe3biwkkSdIOSSk3R23q1E/WTZqUC3D5CwwKsXHjRjp06ADABRdcwAknnMDYsWPnpJRKIuIm4P6U0sM15SNib+A3KaXjtqeexuhRGwy8kVJ6M6X0EXA3MGaLMmOA/0o5zwHFEbFPI9QtSZK0ufpCGuSW61xgUIj58+dz9NFHM2TIENatW8epp54KQETcCfTcIqSNAmYC12xvPY1xMcF+5C47rVFFrtdsW2X2A97ZcmcRMRGYCNCzZ89GaJ4kSdrp1FxIUNOLVhPcatYXaNCgQTz11FOfWp9SGlfPuoeBhz9VuAEaI6jV13+4ZVRtSJncypRuJXcZKyUlJdm9yZskScqmCDjxxNwFBTVDnTfdlNu2++6NMvTZXBojqFUB+9dZ7gEs24EykiRJjaOsbLMb3taGtVYU0qBx5qjNAvpGxAER0RE4g9w4bF0zgW/nr/48ElibUvrUsKckSVKj2TKUtbKQBo3Qo5ZS2hQR/xf4C1AE3J5SeiUiLspv/xXwALkrPt8g9ziFcwutV5Ikqa1rlCcTpJQeIBfG6q77VZ33Cbi4MeqSJEnaWfhkAkmSpIwyqEmSJGWUQU2SJCmjDGqSJEkZZVCTJEnKKIOaJElSRhnUJEmSMsqgJkmSlFEGNUmSpIwyqEmSJGWUQU2SJCmjDGqSJEkZZVCTJEnKKIOaJElSRhnUJEmSMsqgJkmSlFEGNUmSpIwyqEmSJG2nqqoqhg8f3uT1GNQkSZKaUEQU7ehnDWqSJKlNuvzyyznmmGMYMmQI9913H2+//TYnnngixxxzDCNGjKC6uprx48fz9NNPAzB9+nTKysoA+MEPfsCxxx7LoEGDuPXWWwFYt24do0ePZuTIkdx444219bz22ms1vWsHRcRvI2JXgIhYHBG/BGbs6DG039EPSpIkZU5KEMGDDz7I6tWreaKign+uX8+QIUM48MADueSSSzj++OOprq6mXbut91ddeeWVdOrUiQ0bNvClL32Jc889l//4j/9g2LBhXH755fz617/mxRdfBOD73/8+P/7xjznmmGMWAa8AFwBTgX2A61NKb+/o4dijJkmS2oayMpg8GVJi/vz5PPHEEwzff39O7tePDRs28Le//Y1jjz0WoDakRUTtx1NKte9vueUWhg0bxvHHH8/y5ctZvnw5r732GoMHDwbgiCOOqC372muvMXTo0JrFZ4CD8++XFhLSwKAmSZLagpRgzRooL+fdiRN59plnOH6XXahYupSKU09l3ssv079/fyoqKgCorq4GYI899qCqqgqAOXPmALB69Wpuv/12nnjiCf7yl7/QtWtXUkr07duX2bNnAzBr1qzaqg888ECeeeaZmsWhwKL8+48LPayomx6zpqSkJNWcEEmSpM+UUq5HrbwcgH8DntpvP6JvX/bdd1+uu+46LrjgAj788EM6dOjAQw89xKJFizjzzDPp2bMn3bp1o2fPnlx11VWcfvrpVFVVccghhzB37lxmzpxJ165dOf3009m4cSMDBgxg7ty5VFRU8Oqrr3LhhRfy5JNPrgP+ApyTUlofEW+klPoUckgGNUmS1HakRGW7dpwPDAMqx41j1apVnHnmmcydO5cXXniBtWvXctFFFzFx4kTWrVvHN7/5TTZs2MChhx7Kiy++WNvrtr0iYk5KqaQxD8eLCSRJUutX0/E0efJmqz83Zw4z582DCE455ZQGXyCQFc5RkyRJrVtZGXz3u7mf8nIYPx723x/224+hCxbUXmCwPRcIZIVBTZIktV41FxFMnQrPPw+TJuXWL1kC++1H0fHHQ3Exq9es+dQFAr/97W/Zf//9671AoD6VlZWMHDmyiQ9ocw59SpKk1isCpkzJvS8vz4U1yPWonXgi9OkD55xDcUr079+fYcOGccghh7Dnnnvy+9//njvuuIPJkyfz8MMPM2DAAAA+/vhjiop2+GECjcoeNUmS1LrVCWuVwFhg76OP5r777+fvq1axdu1avvnNb7Jq1Sp23XVXrrjiCn72s5+xcOFCzj33XPr06cNtt93G448/zsKFC7ngggt47rnnGDp0KMOGDeNf//Vf2fLiyyVLljB69GiOO+44Ro8ezYoVK/JNiTc+aVY8EhG98j8vRMQdEbEgIs6OiDsj4sWIuPyzDs0eNUmS1LpVV8Mll9QuVgKPde3KLs88w+GDB/Pyyy9z2mmnccYZZ/Dyyy9z2WWX8Yc//IGBAwcyffp0evToQWVlJUuWLKF///7cfvvtlJSU8Lvf/Y7evXtz3nnnce+993LooYfW1nHppZfyb//2bxx55JHMmDGDn/70p9tq5X7A/waK803sBawkd8+167b2IYOaJElqva66CmbOhLlzobQUJk3i4EMPpfMtt0DHjgwYMIB33nmH8vJyfvWrXwHQvn398eewww7j8ccfB2Dt2rX07t0bgKFDh/Lqq69uFtTmz5/PZZddBsCmTZvo06fe26VFnfevppQ+BN6NiKUppXcBImJ9RBSllOq9Oa5BTZIktU4pwdq1uZA2cCDceCOcfz6vfvAB6770JXbp3JkFFRUMGjSIiRMn8rWvfQ2Ajz76CICOHTuyadOm2t3VnZfWtWtX3nzzTXr37s0zzzzDmDFjNqu6f//+XH755Xz5y1+u3eedd94J0C4iPgcUAYfUbe1W3sPmgW4zBjVJktQ6bXkhQT5o9dprLy7o14/X//xnxo0bx3nnncdFF13EzTffTEqJr3zlK3zve9/jtNNOY8KECQwdOpQJEyZstuupU6dy9tlnU1RURP/+/TnllFNYvHhx7fYbbriBiy++mHXr1gFw3nnn1Wz6OfAcMBeoKvgQfTKBJElq1VKC/EPWK4HzR4zgkUceafZmNMWTCbzqU5IktV41z/es67XXPnlSQStnUJMkSa1T3Yewl5ZCdTW9Skt5ZMmS2qcRtHbOUZMkSa1TBBQX50LalCmbz1krLs4tt3LOUZMkSa1bSpuHsi2Xm0nm5qhFxB4R8XBEvJ5/3b2eMvtHxOMRsTAiXomI0kLqlCRJ2syWoawN9KTVKHSO2mXAoymlvsCj+eUtbQK+l1I6BDgSuDgi+hVYryRJUptXaFAbA9yZf38ncOqWBVJK76SUXsy/fx9YSO4xCpIkSfoMhQa1vVNK70AukAF7fVbhiOgFfBl4vsB6JUmS2rxtXvUZEY8A3evZ9MPtqSgiPg/8N/DdlNI/PqPcRGAiQM+ePbenCkmSpDZlmz1qKaWRKaUB9fzMAN6LiH0A8q/L69tHRHQgF9J+nVL64zbquzWlVJJSKvnCF76w/UckSZJ2Otdffz3z588H2NoD0lulQu+jNhMYB1yff52xZYGICOA/gYUppRsLrE+SJOlTLrusvusZW79C56hdD4yKiNeBUfllImLfiHggX+Yo4BzguIiYm/85ucB6JUnSTiqlxIUXXsiwYcMYOnQoL7zwAuPHj+fpp59u6aY1uoJ61FJKfwdG1LN+GXBy/v3TQNu5oYkkSWp+dW5iO2PGDDZ+9BFPP/00b775JmeccQb9+rXNO3/5rE9JkpRtZWWbPbtz0auvMnTJEigro3fv3qxevbpl29eEDGqSJCm7UoI1a3IPXs+HtYP++leeefRRWLOGN//nfyguLm7pVjYZg5okScqumgetl5bmwlq7dpxy330U9e/PsNmzOftb3+Lmm29u6VY2GR/KLkmSsi8laFenf6m6OnPP9MzcQ9klSZKaXEq5Yc+66sxZa8sMapIkKbtqQlp5eW74s7r6k2HQnSCsFXrDW0mSpKYTAcXFuXA2Zconc9Ygtz5jw5+NzTlqkiQp++rcR63e5QxwjpokSdo5bRnKMhbSmopBTZIkKaMMapIkSRllUJMkScoog5okSVJGGdQkSZIyyqAmSZKUUQY1SZKkjDKoSZIkZZRBTZIkKaMMapIkSRllUJMkScoog5okSVJGGdQkSZIyyqAmSZKUUQY1SZKkjDKoSZIkZZRBTZIkNZvKykpGjhzZJPueNm0aDz/8MABTp05tkjqam0FNkiS1CePHj2fUqFGAQU2SJKkgP//5zznkkEM499xza9f16dMHgBEjRrBq1Srmz59Px44def/995k1axYTJ04E4IQTTmD48OEMHjyYZ599FoCysjKmT5/OXXfdxdKlSxk+fDjXXntt8x9YI2rf0g2QJEltXEoQsdmqK664go4dO3LLLbcwffr0T31k+PDhPP7441RVVXHSSSfx5JNPsmDBAo499lgA/vjHP9KpUycWLlzIxRdfzGOPPVb72bPOOosrr7ySioqKJj2s5mBQkyRJTaesDNasgSlTcmEtJV559llWvfIKzy1eXNsbtqURI0Ywffp0Vq5cyVVXXcX06dNZuHAhd9xxB+vXr6e0tJRFixZRVFTE0qVLm/WQmpNDn5IkqWmklAtp5eUweXJu+eqr6f/Pf/LDQYM4/fTT2WOPPaiqqgJg7ty5bNq0CYDBgwfz/PPPs2HDBgYNGsQrr7zC3//+d7p3786DDz5IUVERTz31FL/85S9JKX2q6vbt21NdXd2cR9sk7FGTJElNIyLXkwa5sFZennu///58/b776HDvvVxxxRV07tyZY445hmOOOYb27XPRpH379nTv3p2BAwcC0L17d/r27QvAkCFDuO666xg5ciRHHXVUvVV/4xvfYPTo0Zx00klMmjSpSQ+zKUV9KTQrSkpK0uzZs1u6GZIkqRApQbs6g3jV1Z+as9YWRMSclFJJY+7ToU9JktR0UsoNe9ZVMwyqbTKoSZKkplET0srLobQ015NWWrr5nDV9JueoSZKkphEBxcW5cFZz1WfNnLXi4jY5/NnYnKMmSZKa1pb3UavnvmptgXPUJElS67NlKGuDIa2pGNQkSZIyyqAmSZKUUQUFtYjYIyIejojX86+7f0bZooh4KSLuK6ROSZKknUWhPWqXAY+mlPoCj+aXt6YUWFhgfZIkSTuNQoPaGODO/Ps7gVPrKxQRPYDRwG0F1idJkrTTKDSo7Z1Segcg/7rXVsrdBHwfaP1PR5UkSWom27zhbUQ8AnSvZ9MPG1JBRHwFWJ5SmhMRwxtQfiIwEaBnz54NqUKSJKlN2mZQSymN3Nq2iHgvIvZJKb0TEfsAy+spdhRwSkScDOwCdImI6Smlb22lvluBWyF3w9uGHIQkSVJbVOjQ50xgXP79OGDGlgVSSpenlHqklHoBZwCPbS2kSZIk6ROFBrXrgVER8TowKr9MROwbEQ8U2jhJkqSdWUEPZU8p/R0YUc/6ZcDJ9ayvACoKqVOSJGln4ZMJJEmSMsqgJkmSlFEGNUmSpIwyqEmSJGWUQU2SJCmjDGqSJEkZZVCTJEnKKIOaJElSRhnUJEmSMsqgJkmSlFEGNUmSpIwyqEmSJGWUQU2SJCmjDGqSJEkZZVCTJEnKKIOaJElSRhnUJEmSMsqgJkmSlFEGNUmSpIwyqEmSJGWUQU2SJCmjDGqSJEkZZVCTJEnKKIOaJElSRhnUJEmSMsqgJkmSlFEGNUmSpIwyqEmSJGWUQU2SJCmjDGqSJEkZZVCTJEnKKIOaJEmtzNSpU3f4s9OmTeMf//hHI7ZGTcmgJklSK2NQ23kY1CRJyoCUEhdeeCHDhg1j6NChvPDCCwwfPpyqqioArrnmGqZNm8Zdd93F0qVLGT58ONdeey0VFRWccMIJfP3rX2fgwIH8/ve/B2D8+PE8/fTTAEyfPp2ysjIee+wx5s6dy9ixY/nOd77TYseqhmvf0g2QJGmnlhJEMGPGDDZu3MjTTz3Fm2+9xRlnnMFuu+32qeJnnXUWV155JRUVFQBUVFSwdOlSXnrpJdavX09JSQlf//rX663quOOOY+DAgUyfPp0ePXo05VGpkRjUJElqKWVlsGYNTJnCokWLGDpkCEyeTO/iYlavXk2nTp1qi6aUtrqbL3/5y3To0IEOHTqw1157sWLFCiKiQZ9Vtjn0KUlSS0gpF9LKy2HyZA468ECeuekmKC/nzcWLKS4uZo899qgd+pwzZ07tR9u3b091dXXt8ty5c9m0aRPvv/8+7733Ht26ddvqZzt27MimTZua5RBVOHvUJElqCREwZUrufXk5p5SXcz8wbJ99+PjVV7n55pvZsGED559/PgceeCCf+9znaj/6jW98g9GjR3PSSSdx6KGHsu+++zJ27FjeeustrrnmGoqKijj//PM588wzueuuu+jWrRvFxcUAnHbaaUyYMIGhQ4dy9dVXN/9xa7tElrtDS0pK0uzZs1u6GZIkNZ2UoF2dAa7q6lyIa6CKigqmT5/Obbfd1gSN0/aIiDkppZLG3KdDn5IktZSUYPLkzddNnpxbL1FgUIuIPSLi4Yh4Pf+6+1bKFUfEHyLi1YhYGBFDCqlXkqRWryaklZdDaWmuJ620tHbOWkPD2vDhw+1Na8MK7VG7DHg0pdQXeDS/XJ9y4MGU0sHAYcDCAuuVJKl1i4Di4lw4mzLlkzlrpaW59dsx/Km2q6A5ahGxCBieUnonIvYBKlJKB21RpgvwMtA7bWdlzlGTJLV5+fuobXVZrUYW56jtnVJ6ByD/ulc9ZXoDK4A7IuKliLgtIjrVU06SpJ3PlqHMkKY6thnUIuKRiFhQz8+YBtbRHhgE3JJS+jLwAVsfIiUiJkbE7IiYvWLFigZWIUmS1PZs8z5qKaWRW9sWEe9FxD51hj6X11OsCqhKKT2fX/4DnxHUUkq3ArdCbuhzW+2TJElqqwod+pwJjMu/HwfM2LJASuldYElE1MxdGwH8rcB6JUmS2rxCg9r1wKiIeB0YlV8mIvaNiAfqlPsO8OuImAcMBH5SYL2SJEltXkGPkEop/Z1cD9mW65cBJ9dZngs06lUQkiRJbZ1PJpAkScoog5okSVJGGdQkSZIyyqAmSZKUUQY1SZKkjDKoSZIkZZRBTZIkKaMMapIkSRllUJMkScoog5okSVJGGdQkSZIyyqAmSZKUUQY1SZKkjDKoSZIkZZRBTZIkKaMMapIkSRllUJMkScoog5okSVJGGdQkSZIyyqAmSZKUUQY1SZKkjDKoSZIkZZRBTZIkKaMMapIkSRllUJMkScoog5okqclUVlYycuTIVrdvKSsMapKkTKmurt5s+eOPP26hlkgtr31LN0CS1LatXbuWs88+m0WLFnHOOedw6KGH8uMf/5hNmzaxxx578Nvf/pZddtmFPn36cPrpp/Pss89y6aWXUl5eTpcuXfjiF7/ISSedxJVXXklEcPDBB3PLLbdsVseUKVO4++672W233Tj11FMpLS1toaOVGpdBTZLUuFKCiNrFyspKHnvsMXbZZRcOP/xwZsyYweOPPw7AD37wA373u9/x7W9/m02bNvHVr36Vn/zkJ1RUVLBs2TLuu+8+2rdvz6BBg6ioqKBr165MnjyZ+++/nwEDBtTW8etf/5rHH3+czp07f6pHTmrNDGqSpMZTVgZr1sCUKbmwlhIHd+xI5xtugLIyBgwYwLvvvssFF1zAhg0beO+99+jSpQsARUVFHHnkkbW7KikpoUOHDqxYsYLKykrGjBkDwLp16zjooIM2C2o33XQTkyZNYtOmTVx44YUMGzasOY9aajIGNUlS40gpF9LKy3PLU6bA1Vfz6rJlrFu+nF02bmTBggWUlZXx7//+7wwZMoTvf//7pJQAiAiiTk9cUVERAN26daN3797cd999fP7znwdg48aNLF26tLbsoEGDGDZsGFVVVYwZM4Y5c+Y0zzFLTcygJklqHBG5cAa5sJYPbL322osLVq/m9SFDGDduHN27d2fChAkcdNBBdO3atbZHbeu7DW688UZOOeUUUkq0a9eOKVOmbPa5c845h5UrV/Lhhx9y8cUXN9khSs0tav4nk0UlJSVp9uzZLd0MSdL2SAna1bmpQHX1ZnPWpLYqIuaklEoac5/enkOS1HhSgsmTN183eXJuvaTtZlCTJDWOmpBWXg6lpbmetNLS3LJhTdohzlGTJDWOCCguzoWzmqs+a+asFRc7/CntAOeoSZIa1xb3UfvUstRGOUdNkpR9W4YyQ5q0wwxqkiRJGWVQkyRJyqiCglpE7BERD0fE6/nX3bdSbnJEvBIRCyLiNxGxSyH1SpIk7QwK7VG7DHg0pdQXeDS/vJmI2A+YBJSklAYARcAZBdYrSZLU5hUa1MYAd+bf3wmcupVy7YFdI6I9sBuwrMB6JUmS2rxCg9reKaV3APKve21ZIKW0FPgZ8DbwDrA2pfRQgfVKkiS1edsMahHxSH5u2ZY/YxpSQX7e2hjgAGBfoFNEfOszyk+MiNkRMXvFihUNPQ5JkqQ2Z5tPJkgpjdzatoh4LyL2SSm9ExH7AMvrKTYSeCultCL/mT8CQ4HpW6nvVuBWyN3wdtuHIEmS1DYVOvQ5ExiXfz8OmFFPmbeBIyNit4gIYASwsMB6JUmS2rxCg9r1wKiIeB0YlV8mIvaNiAcAUkrPA38AXgTm5+u8tcB6JUmS2jyf9SlJktQIfNanJEnSTsSgJkmSlFEGNUmSpIwyqEmSJGWUQU2SJCmjDGqSJEkZZVCTJEnKKIOaJElSRhnUJEmSMsqgJkmSlFEGNUmSpIwyqEmSJGWUQU2SJCmjDGqSJEkZZVCTJEnKKIOaJElSRhnUJEmSMsqgJkmSlFEGNUmSpIwyqEmSJGWUQU2SJCmjDGqSJEkZZVCTJEnKKIOaJElSRhnUJEmSMsqgJkmSlFEGNUmSpIwyqEmSJGWUQU2SJCmjDGqSJEkZZVCTJEnKKIOaJElSRhnUJEmSMsqgJkmSlFEGNUmSpIwyqEmSJGWUQU2SJCmjDGpAZWUlI0eO3O7PXXPNNUybNq3xGyRJkoRBTZIkKbPat3QDWkxKEFG7uHbtWs4++2wWLVrEOeecQ9euXbn//vv58MMPqaqqYurUqRx99NE8+eSTTJo0iZ49e/K5z32OHj16tOBBSJKktqygHrWIGBsRr0REdUSUfEa5EyNiUUS8ERGXFVJnoygrg8mTc2ENICUqX3mFX/XsybPPPssdd9zB8uXLef/997n33nu55557mDx5MgCXXHIJM2fOZMaMGaxdu7bljkGSJLV5hQ59LgBOA57cWoGIKAJ+AZwE9APOjIh+Bda741KCNWugvPyTsHb11Ry8fj2d16+nQ/v2DBgwgJQShx9+OAC9evWqDWX/+Mc/6NmzJxHB4MGDW+wwJElS21dQUEspLUwpLdpGscHAGymlN1NKHwF3A2MKqbcgETBlCpSW5sJau3Zwxx28uuuurLv6ajZ9/DELFiwgIpgzZw4Ab7/9Nl26dAGgc+fOVFVVATBr1qwWOwxJktT2Nccctf2AJXWWq4AjmqHerasJa+Xltat69e/PBRMn8vrrrzNu3Dh23313dtttN0aPHs2yZcuYMmUKADfccANf/epX2XfffencuXNLHYEkSdoJbDOoRcQjQPd6Nv0wpTSjAXVEPevSZ9Q3EZgI0LNnzwbsfgeklBv2zOsFzDrqqFx4y19gMG3aNAYOHMiPfvSjzT46fPhwXnrppaZplyRJUh3bDGoppe2/wdjmqoD96yz3AJZ9Rn23ArcClJSUbDXQ7bCakFZenhv+nDLlk2XYLKxJkiS1pOYY+pwF9I2IA4ClwBnAWc1Qb/0ioLj4k5BWMwwKufX5kDZ+/PiWaqEkSRIAkdKOd1pFxNeAm4EvAGuAuSmlEyJiX+C2lNLJ+XInAzcBRcDtKaVrG7L/kpKSNHv27B1u32fa4j5qn1qWJEnaDhExJ6W01duV7YiCetRSSvcA99Szfhlwcp3lB4AHCqmr0W0ZygxpkiQpY3yElCRJUkYZ1CRJkjLKoCZJkpRRBjVJkqSMMqhJkiRllEFNkiQpowxqkiRJGVXQDW+bWkSsABa3dDsaSTdgZUs3Yifnd9CyPP8ty/Pfsjz/Lau5zv+/pJS+0Jg7zHRQa0siYnZj361Y28fvoGV5/luW579lef5bVms+/w59SpIkZZRBTZIkKaMMas3n1pZugPwOWpjnv2V5/luW579ltdrz7xw1SZKkjLJHTZIkKaMMao0sIk6MiEUR8UZEXFbP9oiIqfnt8yJiUEu0s61qwPk/O3/e50XEMxFxWEu0s63a1vmvU+7wiPg4Ir7RnO1r6xpy/iNieETMjYhXIuKJ5m5jW9eAf4O6RsS9EfFy/js4tyXa2RZFxO0RsTwiFmxle6v8/WtQa0QRUQT8AjgJ6AecGRH9tih2EtA3/zMRuKVZG9mGNfD8vwUck1I6FLiaVjxvIWsaeP5ryv0U+EvztrBta8j5j4hi4JfAKSml/sDY5m5nW9bAvwMXA39LKR0GDAduiIiOzdrQtmsacOJnbG+Vv38Nao1rMPBGSunNlNJHwN3AmC3KjAH+K+U8BxRHxD7N3dA2apvnP6X0TEppdX7xOaBHM7exLWvIn3+A7wD/DSxvzsbtBBpy/s8C/phSehsgpeR30Lga8h0koHNEBPB5YBWwqXmb2TallJ4kdz63plX+/jWoNa79gCV1lqvy67a3jHbM9p7bCcCfm7RFO5dtnv+I2A/4GvCrZmzXzqIhf/4PBHaPiIqImBMR32621u0cGvId/Bw4BFgGzAdKU0rVzdO8nV6r/P3bvqUb0MZEPeu2vKy2IWW0Yxp8biPiWHJBbViTtmjn0pDzfxPwg5TSx7kOBTWihpz/9sD/AkYAuwLPRsRzKaXXmrpxO4mGfAcnAHOB44AvAg9HxFMppX80cdvUSn//GtQaVxWwf53lHuT+17S9ZbRjGnRuI+JQ4DbgpJTS35upbTuDhpz/EuDufEjrBpwcEZtSSn9qlha2bQ3992dlSukD4IOIeBI4DDCoNY6GfAfnAten3L2x3oiIt4CDgReap4k7tVb5+9ehz8Y1C+gbEQfkJ4eeAczcosxM4Nv5q0+OBNamlN5p7oa2Uds8/xHRE/gjcI69CI1um+c/pXRASqlXSqkX8Afg/xjSGk1D/v2ZARwdEe0jYjfgCGBhM7ezLWvId/A2uR5NImJv4CDgzWZt5c6rVf7+tUetEaWUNkXE/yV3NVsRcHtK6ZWIuCi//VfAA8DJwBvAP8n970qNoIHn/0pgT+CX+V6dTa31Qb1Z08DzrybSkPOfUloYEQ8C84Bq4LaUUr23MtD2a+DfgauBaRExn9xQ3A9SSitbrNFtSET8htyVtN0iogq4CugArfv3r08mkCRJyiiHPiVJkjLKoCZJkpRRBjVJkqSMMqhJkiRllEFNkiQpowxqkiRJGWVQkyRJyiiDmiRJUkb9f4XN2TEuM67oAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<Figure size 720x360 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "words = ['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'iraq']\n",
    "plot_embeddings(M_reduced_normalized, word2ind, words)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### <font color=\"red\">Write your answer here.</font>\n",
    "\n",
    "- Unlike the window-based co-occurence embeddings, these prediction-based GloVe embeddings do not cluster together countries (\"iraq\", \"ecuador\" and \"kuwait\"). For e.g., Kuwait is largely displaced from Iraq. \n",
    "- GloVe seems to cluster together \"energy\" and \"industry\" more closely than co-occurence embeddings. \"oil\" and \"petroleum\" also cluster together, which aligns well with our expectations. \"barrels\" and \"bpd\" are still far off even though they're expected to be a bit closer since barrels-per-day should be a function of the number of barrels.\n",
    "- These GloVe word vectors were trained upon a much larger corpus that covers many topics. Meanwhile, our co-occurrence word vectors were generated using a relatively smaller corpus of news articles on the topic of crude oil. As a result, the GloVe vectors contain semantic information outside the context of oil. This may explain why the countries are not clustered here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Cosine Similarity\n",
    "Now that we have word vectors, we need a way to quantify the similarity between individual words, according to these vectors. One such metric is cosine-similarity. We will be using this to find words that are \"close\" and \"far\" from one another.\n",
    "\n",
    "We can think of n-dimensional vectors as points in n-dimensional space. If we take this perspective [L1](http://mathworld.wolfram.com/L1-Norm.html) and [L2](http://mathworld.wolfram.com/L2-Norm.html) Distances help quantify the amount of space \"we must travel\" to get between these two points. Another approach is to examine the angle between two vectors. From trigonometry we know that:\n",
    "\n",
    "<img src=\"imgs/inner_product.png\" width=20% style=\"float: center;\"></img>\n",
    "\n",
    "Instead of computing the actual angle, we can leave the similarity in terms of $similarity = cos(\\Theta)$. Formally the [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity) $s$ between two vectors $p$ and $q$ is defined as:\n",
    "\n",
    "$$s = \\frac{p \\cdot q}{||p|| ||q||}, \\textrm{ where } s \\in [-1, 1] $$ "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 2.2: Words with Multiple Meanings (1.5 points) [code + written] \n",
    "Polysemes and homonyms are words that have more than one meaning (see this [wiki page](https://en.wikipedia.org/wiki/Polysemy) to learn more about the difference between polysemes and homonyms ). Find a word with *at least two different meanings* such that the top-10 most similar words (according to cosine similarity) contain related words from *both* meanings. For example, \"leaves\" has both \"go_away\" and \"a_structure_of_a_plant\" meaning in the top 10, and \"scoop\" has both \"handed_waffle_cone\" and \"lowdown\". You will probably need to try several polysemous or homonymic words before you find one. \n",
    "\n",
    "Please state the word you discover and the multiple meanings that occur in the top 10. Why do you think many of the polysemous or homonymic words you tried didn't work (i.e. the top-10 most similar words only contain **one** of the meanings of the words)?\n",
    "\n",
    "**Note**: You should use the `wv_from_bin.most_similar(word)` function to get the top 10 similar words. This function ranks all other words in the vocabulary with respect to their cosine similarity to the given word. For further assistance, please check the __[GenSim documentation](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.FastTextKeyedVectors.most_similar)__."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('weapons', 0.7115006446838379),\n",
       " ('hand', 0.5853789448738098),\n",
       " ('hands', 0.582863986492157),\n",
       " ('weapon', 0.5786144733428955),\n",
       " ('embargo', 0.5249772667884827),\n",
       " ('arm', 0.5146462917327881),\n",
       " ('weaponry', 0.513433039188385),\n",
       " ('nuclear', 0.5115358829498291),\n",
       " ('disarmament', 0.5083263516426086),\n",
       " ('iraq', 0.49865245819091797)]"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "    # ------------------\n",
    "    # Write your implementation here.\n",
    "\n",
    "    wv_from_bin.most_similar(\"arms\")\n",
    "    #wv_from_bin.most_similar(\"mouse\")\n",
    "    #wv_from_bin.most_similar(\"mole\")\n",
    "    #wv_from_bin.most_similar(\"tear\")\n",
    "\n",
    "    # ------------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### <font color=\"red\">Write your answer here.</font>\n",
    "\n",
    "- Since both \"weapons\" and \"arms\" are among the top 10 meanings of the word \"arms\", this implies that the embedding has captured both meanings of \"arms\": weaponry and limb.\n",
    "- A reason why many polysemous words don't exhibit different meanings is that these GloVe vectors are built upon Wiki data in which more often than not words tend to take on the same meanings. Another reason could be that the top 10 similar words sometimes include different forms of the same word (arm) or its meanings (hand/hands)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 2.3: Synonyms & Antonyms (2 points) [code + written] \n",
    "\n",
    "When considering Cosine Similarity, it's often more convenient to think of Cosine Distance, which is simply 1 - Cosine Similarity.\n",
    "\n",
    "Find three words $(w_1,w_2,w_3)$ where $w_1$ and $w_2$ are synonyms and $w_1$ and $w_3$ are antonyms, but Cosine Distance $(w_1,w_3) <$ Cosine Distance $(w_1,w_2)$. \n",
    "\n",
    "As an example, $w_1$=\"happy\" is closer to $w_3$=\"sad\" than to $w_2$=\"cheerful\". Please find a different example that satisfies the above. Once you have found your example, please give a possible explanation for why this counter-intuitive result may have happened.\n",
    "\n",
    "You should use the the `wv_from_bin.distance(w1, w2)` function here in order to compute the cosine distance between two words. Please see the __[GenSim documentation](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.FastTextKeyedVectors.distance)__ for further assistance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Synonyms love, affection have cosine distance: 0.42052197456359863\n",
      "Antonyms love, hate have cosine distance: 0.49353712797164917\n"
     ]
    }
   ],
   "source": [
    "    # ------------------\n",
    "    # Write your implementation here.\n",
    "\n",
    "    w1 = \"love\" # synonym 1\n",
    "    w2 = \"affection\" # synonym 2\n",
    "    w3 = \"hate\" # antonym\n",
    "    w1_w2_dist = wv_from_bin.distance(w1, w2)\n",
    "    w1_w3_dist = wv_from_bin.distance(w1, w3)\n",
    "\n",
    "    print(\"Synonyms {}, {} have cosine distance: {}\".format(w1, w2, w1_w2_dist))\n",
    "    print(\"Antonyms {}, {} have cosine distance: {}\".format(w1, w3, w1_w3_dist))\n",
    "\n",
    "    # ------------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### <font color=\"red\">Write your answer here.</font>\n",
    "\n",
    "- The proximity of words to each other in a sequence, i.e., the context, carries more weight than their similarity in meaning/semantics in determining word embeddings.\n",
    "- Some words can be antonyms, but can still be used in the same context, so their distance is lower than a pair of synonyms. Conversely, even where two words are synonyms, they can be used in different contexts. As such, $w_1$ and $w_3$ might have appeared in more similar contexts than $w_1$ and $w_2$. This could lead to $w_1's$ and $w_3's$ vectors to be more similar to each other than $w_1's$ and $w_2's$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 2.4: Analogies with Word Vectors [written] (1.5 points)\n",
    "Word vectors have been shown to *sometimes* exhibit the ability to solve analogies. \n",
    "\n",
    "As an example, for the analogy \"man : king :: woman : x\" (read: man is to king as woman is to x), what is x?\n",
    "\n",
    "In the cell below, we show you how to use word vectors to find x using the `most_similar` function from the __[GenSim documentation](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.most_similar)__. The function finds words that are most similar to the words in the `positive` list and most dissimilar from the words in the `negative` list (while omitting the input words, which are often the most similar; see [this paper](https://www.aclweb.org/anthology/N18-2039.pdf)). The answer to the analogy will have the highest cosine similarity (largest returned numerical value)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[('queen', 0.6978679299354553),\n",
      " ('princess', 0.6081743836402893),\n",
      " ('monarch', 0.5889754891395569),\n",
      " ('throne', 0.5775110125541687),\n",
      " ('prince', 0.5750998258590698),\n",
      " ('elizabeth', 0.5463595986366272),\n",
      " ('daughter', 0.5399126410484314),\n",
      " ('kingdom', 0.5318052768707275),\n",
      " ('mother', 0.5168544054031372),\n",
      " ('crown', 0.5164472460746765)]\n"
     ]
    }
   ],
   "source": [
    "# Run this cell to answer the analogy -- man : king :: woman : x\n",
    "pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'king'], negative=['man']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let $m$, $k$, $w$, and $x$ denote the word vectors for `man`, `king`, `woman`, and the answer, respectively. Using **only** vectors $m$, $k$, $w$, and the vector arithmetic operators $+$ and $-$ in your answer, what is the expression in which we are maximizing cosine similarity with $x$?\n",
    "\n",
    "Hint: Recall that word vectors are simply multi-dimensional vectors that represent a word. It might help to draw out a 2D example using arbitrary locations of each vector. Where would `man` and `woman` lie in the coordinate plane relative to `king` and the answer?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### <font color=\"red\">Write your answer here.</font>\n",
    "\n",
    "The expression in which we are maximizing cosine similarity with $x$ is:\n",
    "$$|k - m| \\approx |x - w|$$\n",
    "\n",
    "Programmatically, we are trying to ensure that the distance (measured using cosine similarity) between $k - m$ is as similar as possible to $x - w$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 2.5: Finding Analogies [code + written]  (1.5 points)\n",
    "Find an example of analogy that holds according to these vectors (i.e. the intended word is ranked top). In your solution please state the full analogy in the form x:y :: a:b. If you believe the analogy is complicated, explain why the analogy holds in one or two sentences.\n",
    "\n",
    "**Note**: You may have to try many analogies to find one that works!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[('actress', 0.8572621941566467),\n",
      " ('actresses', 0.6734701991081238),\n",
      " ('actors', 0.6297088861465454),\n",
      " ('starring', 0.6084522604942322),\n",
      " ('starred', 0.5989463925361633),\n",
      " ('screenwriter', 0.595988929271698),\n",
      " ('dancer', 0.5881683230400085),\n",
      " ('comedian', 0.5791141390800476),\n",
      " ('singer', 0.5661861300468445),\n",
      " ('married', 0.5574131011962891)]\n"
     ]
    }
   ],
   "source": [
    "    # ------------------\n",
    "    # Write your implementation here.\n",
    "\n",
    "    pprint.pprint(wv_from_bin.most_similar(positive=['woman','actor'], negative=['man']))\n",
    "    #pprint.pprint(wv_from_bin.most_similar(positive=['paris', 'italy'], negative=['rome']))\n",
    "    \n",
    "    # ------------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### <font color=\"red\">Write your answer here.</font>\n",
    "\n",
    "- Example 1:\n",
    "    - man:actor :: woman:actress\n",
    "- Example 2:\n",
    "    - rome:italy :: paris:france"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 2.6: Incorrect Analogy [code + written] (1.5 points)\n",
    "Find an example of analogy that does *not* hold according to these vectors. In your solution, state the intended analogy in the form x:y :: a:b, and state the (incorrect) value of b according to the word vectors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[('45,000-square', 0.4922032356262207),\n",
      " ('15,000-square', 0.4649604260921478),\n",
      " ('10,000-square', 0.45447564125061035),\n",
      " ('6,000-square', 0.44975778460502625),\n",
      " ('3,500-square', 0.4441334307193756),\n",
      " ('700-square', 0.44257500767707825),\n",
      " ('50,000-square', 0.4356396794319153),\n",
      " ('3,000-square', 0.43486514687538147),\n",
      " ('30,000-square', 0.4330596923828125),\n",
      " ('footed', 0.43236875534057617)]\n"
     ]
    }
   ],
   "source": [
    "    # ------------------\n",
    "    # Write your implementation here.\n",
    "\n",
    "    pprint.pprint(wv_from_bin.most_similar(positive=[\"foot\", \"glove\"], negative=[\"hand\"]))\n",
    "    #pprint.pprint(wv_from_bin.most_similar(positive=[\"england\", \"india\"], negative=[\"english\"]))\n",
    "    #pprint.pprint(wv_from_bin.most_similar(positive=[\"america\", \"peacock\"], negative=[\"india\"]))\n",
    "\n",
    "    # ------------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### <font color=\"red\">Write your answer here.</font>\n",
    "\n",
    "- Example 1:\n",
    "    - Actual: hand:glove :: foot:sock\n",
    "    - Expected: hand:glove :: foot:sock\n",
    "    - Explanation: Intended analogy is \"hand : glove :: foot : sock\" but all the returned words are related to foot (as a measurement), body parts, or other things.\n",
    "- Example 2:\n",
    "    - Actual: england:english :: india:pakistan\n",
    "    - Expected: england:english :: india:hindi\n",
    "- Example 3:\n",
    "    - Actual: india:peacock :: america:nbc\n",
    "    - Expected: india:peacock :: america:eagle "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 2.7: Guided Analysis of Bias in Word Vectors [written] (1 point)\n",
    "\n",
    "It's important to be cognizant of the biases (gender, race, sexual orientation etc.) implicit in our word embeddings. Bias can be dangerous because it can reinforce stereotypes through applications that employ these models.\n",
    "\n",
    "Run the cell below, to examine (a) which terms are most similar to \"woman\" and \"worker\" and most dissimilar to \"man\", and (b) which terms are most similar to \"man\" and \"worker\" and most dissimilar to \"woman\". Point out the difference between the list of female-associated words and the list of male-associated words, and explain how it is reflecting gender bias."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[('employee', 0.6375863552093506),\n",
      " ('workers', 0.6068919897079468),\n",
      " ('nurse', 0.5837946534156799),\n",
      " ('pregnant', 0.536388635635376),\n",
      " ('mother', 0.5321309566497803),\n",
      " ('employer', 0.5127025842666626),\n",
      " ('teacher', 0.5099576711654663),\n",
      " ('child', 0.5096741914749146),\n",
      " ('homemaker', 0.5019454956054688),\n",
      " ('nurses', 0.4970572292804718)]\n",
      "\n",
      "[('workers', 0.611325740814209),\n",
      " ('employee', 0.5983108282089233),\n",
      " ('working', 0.5615329146385193),\n",
      " ('laborer', 0.5442320108413696),\n",
      " ('unemployed', 0.5368516445159912),\n",
      " ('job', 0.5278826951980591),\n",
      " ('work', 0.5223962664604187),\n",
      " ('mechanic', 0.5088937282562256),\n",
      " ('worked', 0.5054520964622498),\n",
      " ('factory', 0.4940453767776489)]\n"
     ]
    }
   ],
   "source": [
    "# Run this cell\n",
    "# Here `positive` indicates the list of words to be similar to and `negative` indicates the list of words to be\n",
    "# most dissimilar from.\n",
    "pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'worker'], negative=['man']))\n",
    "print()\n",
    "pprint.pprint(wv_from_bin.most_similar(positive=['man', 'worker'], negative=['woman']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### <font color=\"red\">Write your answer here.</font>\n",
    "\n",
    "There seems to be some bias in the vectors generated around occupations with respect to gender.\n",
    "\n",
    "- Nurse, teacher, homemaker are among the top 10 terms most similar to \"woman\" and \"worker\" but most disimilar to \"man\".\n",
    "- Laborer, mechanic and factory are examples of terms most similar to \"man\" and \"worker\" but most disimilar to \"woman\".\n",
    "\n",
    "The vectors thus seem to have learnt some discrepency in employment roles among genders. However, some gender-neutral words like employee, workers do show up in both cases."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 2.8: Independent Analysis of Bias in Word Vectors [code + written]  (1 point)\n",
    "\n",
    "Use the `most_similar` function to find another case where some bias is exhibited by the vectors. Please briefly explain the example of bias that you discover."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[('professions', 0.5957457423210144),\n",
      " ('practitioner', 0.49884119629859924),\n",
      " ('teaching', 0.48292142152786255),\n",
      " ('nursing', 0.4821180999279022),\n",
      " ('vocation', 0.4788965880870819),\n",
      " ('teacher', 0.47160348296165466),\n",
      " ('practicing', 0.4693780839443207),\n",
      " ('educator', 0.46524325013160706),\n",
      " ('physicians', 0.4628995358943939),\n",
      " ('professionals', 0.4601393938064575)]\n",
      "\n",
      "[('reputation', 0.5250176191329956),\n",
      " ('professions', 0.5178037881851196),\n",
      " ('skill', 0.49046963453292847),\n",
      " ('skills', 0.49005505442619324),\n",
      " ('ethic', 0.4897659420967102),\n",
      " ('business', 0.4875851273536682),\n",
      " ('respected', 0.4859202802181244),\n",
      " ('practice', 0.4821045994758606),\n",
      " ('regarded', 0.47785723209381104),\n",
      " ('life', 0.4760662317276001)]\n"
     ]
    }
   ],
   "source": [
    "    # ------------------\n",
    "    # Write your implementation here.\n",
    "\n",
    "    pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'profession'], negative=['man']))\n",
    "    print()\n",
    "    pprint.pprint(wv_from_bin.most_similar(positive=['man', 'profession'], negative=['woman']))\n",
    "\n",
    "    # ------------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### <font color=\"red\">Write your answer here.</font>\n",
    "\n",
    "- The professions that are most similar to \"man\" but most dissimilar to \"woman\" draws up a range of careers such as physicians, educator/teacher etc. and also features keywords such as practitioner and professionals.\n",
    "- On the flipside, the embeddings that are most similar to \"woman\" and profession but most dissimilar to \"man\" are reputation, skills, ethic, business, etc. \n",
    "- This example thus exhibits a clear bias in gender-specific career choices, since careers like physicians, educator/teacher have more similarity to \"man\" than \"woman\"."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 2.9: Thinking About Bias [written] (2 points)\n",
    "\n",
    "Give one explanation of how bias gets into the word vectors. What is an experiment that you could do to test for or to measure this source of bias?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### <font color=\"red\">Write your answer here.</font>\n",
    "\n",
    "- \"Your model is only as good as the data it is trained on\" a.k.a. \"garbage in/garbage out\". \n",
    "- Since your model solely relies on the input data to generate its embeddings, biases inherent to society can be implicitly propagated through the data the model is trained on. This raises concerns because their widespread use often tends to amplify these biases. \n",
    "- Often news articles exhibit bias related to race, religion, gender, sexual orientation, etc. Since the training objective is to maximize the probability of prediciting the next word correctly, if the context windows within the data have implicit biases, they will likely be captured by the model. For instance, while the association between the words \"female\" and \"queen\" is desired, the association between between the words \"female\" and \"receptionist\" indicates a unhealthy gender stereotype that needs to be neutralized, i.e. not be gender-related.\n",
    "- To test or measure the source of bias, you can compute a vector $g = e_{woman}-e_{man}$, where $e_{woman}$ represents the word vector corresponding to the word \"woman\", and $e_{man}$ corresponds to the word vector corresponding to the word \"man\". The resulting vector $g$ thus roughly encodes the concept of \"gender\". Now, by comparing the distance between $g$ and a list of say, professions, we can uncover unhealthy gender stereotypes.\n",
    "- Further, using an equalization algorithm proposed by [Boliukbasi et al. (2016)](https://arxiv.org/abs/1607.06520), we can debias word vectors to some extent by modifying them to reduce gender stereotypes (but not eliminate it altogether)."
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "cs224n",
   "language": "python",
   "name": "cs224n"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: a1/Gensim word vector visualization.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Gensim word vector visualization of various word vectors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "# Get the interactive Tools for Matplotlib\n",
    "%matplotlib notebook\n",
    "import matplotlib.pyplot as plt\n",
    "plt.style.use('ggplot')\n",
    "\n",
    "from sklearn.decomposition import PCA\n",
    "\n",
    "from gensim.test.utils import datapath, get_tmpfile\n",
    "from gensim.models import KeyedVectors\n",
    "from gensim.scripts.glove2word2vec import glove2word2vec"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For looking at word vectors, I'll use Gensim. We also use it in hw1 for word vectors. Gensim isn't really a deep learning package. It's a package for for word and text similarity modeling, which started with (LDA-style) topic models and grew into SVD and neural word representations. But its efficient and scalable, and quite widely used."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our homegrown Stanford offering is GloVe word vectors. Gensim doesn't give them first class support, but allows you to convert a file of GloVe vectors into word2vec format. You can download the GloVe vectors from [the Glove page](https://nlp.stanford.edu/projects/glove/). They're inside [this zip file](https://nlp.stanford.edu/data/glove.6B.zip)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(I use the 100d vectors below as a mix between speed and smallness vs. quality. If you try out the 50d vectors, they basically work for similarity but clearly aren't as good for analogy problems. If you load the 300d vectors, they're even better than the 100d vectors.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "glove_file = datapath('/Users/manning/Corpora/GloVe/glove.6B.100d.txt')\n",
    "word2vec_glove_file = get_tmpfile(\"glove.6B.100d.word2vec.txt\")\n",
    "glove2word2vec(glove_file, word2vec_glove_file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = KeyedVectors.load_word2vec_format(word2vec_glove_file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.most_similar('obama')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.most_similar('banana')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.most_similar(negative='banana')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result = model.most_similar(positive=['woman', 'king'], negative=['man'])\n",
    "print(\"{}: {:.4f}\".format(*result[0]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def analogy(x1, x2, y1):\n",
    "    result = model.most_similar(positive=[y1, x2], negative=[x1])\n",
    "    return result[0][0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Analogy](imgs/word2vec-king-queen-composition.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "analogy('japan', 'japanese', 'australia')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "analogy('australia', 'beer', 'france')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "analogy('obama', 'clinton', 'reagan')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "analogy('tall', 'tallest', 'long')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "analogy('good', 'fantastic', 'bad')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(model.doesnt_match(\"breakfast cereal dinner lunch\".split()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def display_pca_scatterplot(model, words=None, sample=0):\n",
    "    if words == None:\n",
    "        if sample > 0:\n",
    "            words = np.random.choice(list(model.vocab.keys()), sample)\n",
    "        else:\n",
    "            words = [ word for word in model.vocab ]\n",
    "        \n",
    "    word_vectors = np.array([model[w] for w in words])\n",
    "\n",
    "    twodim = PCA().fit_transform(word_vectors)[:,:2]\n",
    "    \n",
    "    plt.figure(figsize=(6,6))\n",
    "    plt.scatter(twodim[:,0], twodim[:,1], edgecolors='k', c='r')\n",
    "    for word, (x,y) in zip(words, twodim):\n",
    "        plt.text(x+0.05, y+0.05, word)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "display_pca_scatterplot(model, \n",
    "                        ['coffee', 'tea', 'beer', 'wine', 'brandy', 'rum', 'champagne', 'water',\n",
    "                         'spaghetti', 'borscht', 'hamburger', 'pizza', 'falafel', 'sushi', 'meatballs',\n",
    "                         'dog', 'horse', 'cat', 'monkey', 'parrot', 'koala', 'lizard',\n",
    "                         'frog', 'toad', 'monkey', 'ape', 'kangaroo', 'wombat', 'wolf',\n",
    "                         'france', 'germany', 'hungary', 'luxembourg', 'australia', 'fiji', 'china',\n",
    "                         'homework', 'assignment', 'problem', 'exam', 'test', 'class',\n",
    "                         'school', 'college', 'university', 'institute'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "display_pca_scatterplot(model, sample=300)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: a1/README.txt
================================================
Welcome to CS224N!

We'll be using Python throughout the course. If you've got a good Python setup already, great! But make sure that it is at least Python version 3.5. If not, the easiest thing to do is to make sure you have at least 3GB free on your computer and then to head over to (https://www.anaconda.com/download/) and install the Python 3 version of Anaconda. It will work on any operating system.

After you have installed conda, close any open terminals you might have. Then open a new terminal and run the following command:

# 1. Create an environment with dependencies specified in env.yml:
    
    conda env create -f env.yml

# 2. Activate the new environment:
    
    conda activate cs224n
    
# 3. Inside the new environment, instatll IPython kernel so we can use this environment in jupyter notebook: 
    
    python -m ipykernel install --user --name cs224n


# 4. Homework 1 (only) is a Jupyter Notebook. With the above done you should be able to get underway by typing:

    jupyter notebook exploring_word_vectors.ipynb
    
# 5. To make sure we are using the right environment, go to the toolbar of exploring_word_vectors.ipynb, click on Kernel -> Change kernel, you should see and select cs224n in the drop-down menu.

# To deactivate an active environment, use
    
    conda deactivate


================================================
FILE: a1/env.yml
================================================
name: cs224n
channels:
  - defaults
  - anaconda
dependencies:
  - jupyter
  - matplotlib
  - numpy
  - python=3.7
  - ipykernel
  - scikit-learn
  - nltk
  - gensim
 


================================================
FILE: a1/exploring_word_vectors.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# CS224N Assignment 1: Exploring Word Vectors (25 Points)\n",
    "### <font color='blue'> Due 4:30pm, Tue Jan 19 </font>\n",
    "\n",
    "Welcome to CS224N! \n",
    "\n",
    "Before you start, make sure you read the README.txt in the same directory as this notebook for important setup information. A lot of code is provided in this notebook, and we highly encourage you to read and understand it as part of the learning :)\n",
    "\n",
    "If you aren't super familiar with Python, Numpy, or Matplotlib, we recommend you check out the review session on Friday. The session will be recorded and the material will be made available on our [website](http://web.stanford.edu/class/cs224n/index.html#schedule). The CS231N Python/Numpy [tutorial](https://cs231n.github.io/python-numpy-tutorial/) is also a great resource.\n",
    "\n",
    "\n",
    "**Assignment Notes:** Please make sure to save the notebook as you go along. Submission Instructions are located at the bottom of the notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# All Import Statements Defined Here\n",
    "# Note: Do not add to this list.\n",
    "# ----------------\n",
    "\n",
    "import sys\n",
    "assert sys.version_info[0]==3\n",
    "assert sys.version_info[1] >= 5\n",
    "\n",
    "from gensim.models import KeyedVectors\n",
    "from gensim.test.utils import datapath\n",
    "import pprint\n",
    "import matplotlib.pyplot as plt\n",
    "plt.rcParams['figure.figsize'] = [10, 5]\n",
    "import nltk\n",
    "nltk.download('reuters')\n",
    "from nltk.corpus import reuters\n",
    "import numpy as np\n",
    "import random\n",
    "import scipy as sp\n",
    "from sklearn.decomposition import TruncatedSVD\n",
    "from sklearn.decomposition import PCA\n",
    "\n",
    "START_TOKEN = '<START>'\n",
    "END_TOKEN = '<END>'\n",
    "\n",
    "np.random.seed(0)\n",
    "random.seed(0)\n",
    "# ----------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Word Vectors\n",
    "\n",
    "Word Vectors are often used as a fundamental component for downstream NLP tasks, e.g. question answering, text generation, translation, etc., so it is important to build some intuitions as to their strengths and weaknesses. Here, you will explore two types of word vectors: those derived from *co-occurrence matrices*, and those derived via *GloVe*. \n",
    "\n",
    "**Note on Terminology:** The terms \"word vectors\" and \"word embeddings\" are often used interchangeably. The term \"embedding\" refers to the fact that we are encoding aspects of a word's meaning in a lower dimensional space. As [Wikipedia](https://en.wikipedia.org/wiki/Word_embedding) states, \"*conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with a much lower dimension*\"."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 1: Count-Based Word Vectors (10 points)\n",
    "\n",
    "Most word vector models start from the following idea:\n",
    "\n",
    "*You shall know a word by the company it keeps ([Firth, J. R. 1957:11](https://en.wikipedia.org/wiki/John_Rupert_Firth))*\n",
    "\n",
    "Many word vector implementations are driven by the idea that similar words, i.e., (near) synonyms, will be used in similar contexts. As a result, similar words will often be spoken or written along with a shared subset of words, i.e., contexts. By examining these contexts, we can try to develop embeddings for our words. With this intuition in mind, many \"old school\" approaches to constructing word vectors relied on word counts. Here we elaborate upon one of those strategies, *co-occurrence matrices* (for more information, see [here](http://web.stanford.edu/class/cs124/lec/vectorsemantics.video.pdf) or [here](https://medium.com/data-science-group-iitr/word-embedding-2d05d270b285))."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Co-Occurrence\n",
    "\n",
    "A co-occurrence matrix counts how often things co-occur in some environment. Given some word $w_i$ occurring in the document, we consider the *context window* surrounding $w_i$. Supposing our fixed window size is $n$, then this is the $n$ preceding and $n$ subsequent words in that document, i.e. words $w_{i-n} \\dots w_{i-1}$ and $w_{i+1} \\dots w_{i+n}$. We build a *co-occurrence matrix* $M$, which is a symmetric word-by-word matrix in which $M_{ij}$ is the number of times $w_j$ appears inside $w_i$'s window among all documents.\n",
    "\n",
    "**Example: Co-Occurrence with Fixed Window of n=1**:\n",
    "\n",
    "Document 1: \"all that glitters is not gold\"\n",
    "\n",
    "Document 2: \"all is well that ends well\"\n",
    "\n",
    "\n",
    "|     *    | `<START>` | all | that | glitters | is   | not  | gold  | well | ends | `<END>` |\n",
    "|----------|-------|-----|------|----------|------|------|-------|------|------|-----|\n",
    "| `<START>`    | 0     | 2   | 0    | 0        | 0    | 0    | 0     | 0    | 0    | 0   |\n",
    "| all      | 2     | 0   | 1    | 0        | 1    | 0    | 0     | 0    | 0    | 0   |\n",
    "| that     | 0     | 1   | 0    | 1        | 0    | 0    | 0     | 1    | 1    | 0   |\n",
    "| glitters | 0     | 0   | 1    | 0        | 1    | 0    | 0     | 0    | 0    | 0   |\n",
    "| is       | 0     | 1   | 0    | 1        | 0    | 1    | 0     | 1    | 0    | 0   |\n",
    "| not      | 0     | 0   | 0    | 0        | 1    | 0    | 1     | 0    | 0    | 0   |\n",
    "| gold     | 0     | 0   | 0    | 0        | 0    | 1    | 0     | 0    | 0    | 1   |\n",
    "| well     | 0     | 0   | 1    | 0        | 1    | 0    | 0     | 0    | 1    | 1   |\n",
    "| ends     | 0     | 0   | 1    | 0        | 0    | 0    | 0     | 1    | 0    | 0   |\n",
    "| `<END>`      | 0     | 0   | 0    | 0        | 0    | 0    | 1     | 1    | 0    | 0   |\n",
    "\n",
    "**Note:** In NLP, we often add `<START>` and `<END>` tokens to represent the beginning and end of sentences, paragraphs or documents. In thise case we imagine `<START>` and `<END>` tokens encapsulating each document, e.g., \"`<START>` All that glitters is not gold `<END>`\", and include these tokens in our co-occurrence counts.\n",
    "\n",
    "The rows (or columns) of this matrix provide one type of word vectors (those based on word-word co-occurrence), but the vectors will be large in general (linear in the number of distinct words in a corpus). Thus, our next step is to run *dimensionality reduction*. In particular, we will run *SVD (Singular Value Decomposition)*, which is a kind of generalized *PCA (Principal Components Analysis)* to select the top $k$ principal components. Here's a visualization of dimensionality reduction with SVD. In this picture our co-occurrence matrix is $A$ with $n$ rows corresponding to $n$ words. We obtain a full matrix decomposition, with the singular values ordered in the diagonal $S$ matrix, and our new, shorter length-$k$ word vectors in $U_k$.\n",
    "\n",
    "![Picture of an SVD](./imgs/svd.png \"SVD\")\n",
    "\n",
    "This reduced-dimensionality co-occurrence representation preserves semantic relationships between words, e.g. *doctor* and *hospital* will be closer than *doctor* and *dog*. \n",
    "\n",
    "**Notes:** If you can barely remember what an eigenvalue is, here's [a slow, friendly introduction to SVD](https://davetang.org/file/Singular_Value_Decomposition_Tutorial.pdf). If you want to learn more thoroughly about PCA or SVD, feel free to check out lectures [7](https://web.stanford.edu/class/cs168/l/l7.pdf), [8](http://theory.stanford.edu/~tim/s15/l/l8.pdf), and [9](https://web.stanford.edu/class/cs168/l/l9.pdf) of CS168. These course notes provide a great high-level treatment of these general purpose algorithms. Though, for the purpose of this class, you only need to know how to extract the k-dimensional embeddings by utilizing pre-programmed implementations of these algorithms from the numpy, scipy, or sklearn python packages. In practice, it is challenging to apply full SVD to large corpora because of the memory needed to perform PCA or SVD. However, if you only want the top $k$ vector components for relatively small $k$ — known as [Truncated SVD](https://en.wikipedia.org/wiki/Singular_value_decomposition#Truncated_SVD) — then there are reasonably scalable techniques to compute those iteratively."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Plotting Co-Occurrence Word Embeddings\n",
    "\n",
    "Here, we will be using the Reuters (business and financial news) corpus. If you haven't run the import cell at the top of this page, please run it now (click it and press SHIFT-RETURN). The corpus consists of 10,788 news documents totaling 1.3 million words. These documents span 90 categories and are split into train and test. For more details, please see https://www.nltk.org/book/ch02.html. We provide a `read_corpus` function below that pulls out only articles from the \"crude\" (i.e. news articles about oil, gas, etc.) category. The function also adds `<START>` and `<END>` tokens to each of the documents, and lowercases words. You do **not** have to perform any other kind of pre-processing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def read_corpus(category=\"crude\"):\n",
    "    \"\"\" Read files from the specified Reuter's category.\n",
    "        Params:\n",
    "            category (string): category name\n",
    "        Return:\n",
    "            list of lists, with words from each of the processed files\n",
    "    \"\"\"\n",
    "    files = reuters.fileids(category)\n",
    "    return [[START_TOKEN] + [w.lower() for w in list(reuters.words(f))] + [END_TOKEN] for f in files]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's have a look what these documents are like…."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "reuters_corpus = read_corpus()\n",
    "pprint.pprint(reuters_corpus[:3], compact=True, width=100)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 1.1: Implement `distinct_words` [code] (2 points)\n",
    "\n",
    "Write a method to work out the distinct words (word types) that occur in the corpus. You can do this with `for` loops, but it's more efficient to do it with Python list comprehensions. In particular, [this](https://coderwall.com/p/rcmaea/flatten-a-list-of-lists-in-one-line-in-python) may be useful to flatten a list of lists. If you're not familiar with Python list comprehensions in general, here's [more information](https://python-3-patterns-idioms-test.readthedocs.io/en/latest/Comprehensions.html).\n",
    "\n",
    "Your returned `corpus_words` should be sorted. You can use python's `sorted` function for this.\n",
    "\n",
    "You may find it useful to use [Python sets](https://www.w3schools.com/python/python_sets.asp) to remove duplicate words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def distinct_words(corpus):\n",
    "    \"\"\" Determine a list of distinct words for the corpus.\n",
    "        Params:\n",
    "            corpus (list of list of strings): corpus of documents\n",
    "        Return:\n",
    "            corpus_words (list of strings): sorted list of distinct words across the corpus\n",
    "            num_corpus_words (integer): number of distinct words across the corpus\n",
    "    \"\"\"\n",
    "    corpus_words = []\n",
    "    num_corpus_words = -1\n",
    "    \n",
    "    # ------------------\n",
    "    # Write your implementation here.\n",
    "\n",
    "\n",
    "    # ------------------\n",
    "\n",
    "    return corpus_words, num_corpus_words"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ---------------------\n",
    "# Run this sanity check\n",
    "# Note that this not an exhaustive check for correctness.\n",
    "# ---------------------\n",
    "\n",
    "# Define toy corpus\n",
    "test_corpus = [\"{} All that glitters isn't gold {}\".format(START_TOKEN, END_TOKEN).split(\" \"), \"{} All's well that ends well {}\".format(START_TOKEN, END_TOKEN).split(\" \")]\n",
    "test_corpus_words, num_corpus_words = distinct_words(test_corpus)\n",
    "\n",
    "# Correct answers\n",
    "ans_test_corpus_words = sorted([START_TOKEN, \"All\", \"ends\", \"that\", \"gold\", \"All's\", \"glitters\", \"isn't\", \"well\", END_TOKEN])\n",
    "ans_num_corpus_words = len(ans_test_corpus_words)\n",
    "\n",
    "# Test correct number of words\n",
    "assert(num_corpus_words == ans_num_corpus_words), \"Incorrect number of distinct words. Correct: {}. Yours: {}\".format(ans_num_corpus_words, num_corpus_words)\n",
    "\n",
    "# Test correct words\n",
    "assert (test_corpus_words == ans_test_corpus_words), \"Incorrect corpus_words.\\nCorrect: {}\\nYours:   {}\".format(str(ans_test_corpus_words), str(test_corpus_words))\n",
    "\n",
    "# Print Success\n",
    "print (\"-\" * 80)\n",
    "print(\"Passed All Tests!\")\n",
    "print (\"-\" * 80)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 1.2: Implement `compute_co_occurrence_matrix` [code] (3 points)\n",
    "\n",
    "Write a method that constructs a co-occurrence matrix for a certain window-size $n$ (with a default of 4), considering words $n$ before and $n$ after the word in the center of the window. Here, we start to use `numpy (np)` to represent vectors, matrices, and tensors. If you're not familiar with NumPy, there's a NumPy tutorial in the second half of this cs231n [Python NumPy tutorial](http://cs231n.github.io/python-numpy-tutorial/).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def compute_co_occurrence_matrix(corpus, window_size=4):\n",
    "    \"\"\" Compute co-occurrence matrix for the given corpus and window_size (default of 4).\n",
    "    \n",
    "        Note: Each word in a document should be at the center of a window. Words near edges will have a smaller\n",
    "              number of co-occurring words.\n",
    "              \n",
    "              For example, if we take the document \"<START> All that glitters is not gold <END>\" with window size of 4,\n",
    "              \"All\" will co-occur with \"<START>\", \"that\", \"glitters\", \"is\", and \"not\".\n",
    "    \n",
    "        Params:\n",
    "            corpus (list of list of strings): corpus of documents\n",
    "            window_size (int): size of context window\n",
    "        Return:\n",
    "            M (a symmetric numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)): \n",
    "                Co-occurence matrix of word counts. \n",
    "                The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function.\n",
    "            word2ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M.\n",
    "    \"\"\"\n",
    "    words, num_words = distinct_words(corpus)\n",
    "    M = None\n",
    "    word2ind = {}\n",
    "    \n",
    "    # ------------------\n",
    "    # Write your implementation here.\n",
    "\n",
    "\n",
    "    # ------------------\n",
    "\n",
    "    return M, word2ind"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ---------------------\n",
    "# Run this sanity check\n",
    "# Note that this is not an exhaustive check for correctness.\n",
    "# ---------------------\n",
    "\n",
    "# Define toy corpus and get student's co-occurrence matrix\n",
    "test_corpus = [\"{} All that glitters isn't gold {}\".format(START_TOKEN, END_TOKEN).split(\" \"), \"{} All's well that ends well {}\".format(START_TOKEN, END_TOKEN).split(\" \")]\n",
    "M_test, word2ind_test = compute_co_occurrence_matrix(test_corpus, window_size=1)\n",
    "\n",
    "# Correct M and word2ind\n",
    "M_test_ans = np.array( \n",
    "    [[0., 0., 0., 0., 0., 0., 1., 0., 0., 1.,],\n",
    "     [0., 0., 1., 1., 0., 0., 0., 0., 0., 0.,],\n",
    "     [0., 1., 0., 0., 0., 0., 0., 0., 1., 0.,],\n",
    "     [0., 1., 0., 0., 0., 0., 0., 0., 0., 1.,],\n",
    "     [0., 0., 0., 0., 0., 0., 0., 0., 1., 1.,],\n",
    "     [0., 0., 0., 0., 0., 0., 0., 1., 1., 0.,],\n",
    "     [1., 0., 0., 0., 0., 0., 0., 1., 0., 0.,],\n",
    "     [0., 0., 0., 0., 0., 1., 1., 0., 0., 0.,],\n",
    "     [0., 0., 1., 0., 1., 1., 0., 0., 0., 1.,],\n",
    "     [1., 0., 0., 1., 1., 0., 0., 0., 1., 0.,]]\n",
    ")\n",
    "ans_test_corpus_words = sorted([START_TOKEN, \"All\", \"ends\", \"that\", \"gold\", \"All's\", \"glitters\", \"isn't\", \"well\", END_TOKEN])\n",
    "word2ind_ans = dict(zip(ans_test_corpus_words, range(len(ans_test_corpus_words))))\n",
    "\n",
    "# Test correct word2ind\n",
    "assert (word2ind_ans == word2ind_test), \"Your word2ind is incorrect:\\nCorrect: {}\\nYours: {}\".format(word2ind_ans, word2ind_test)\n",
    "\n",
    "# Test correct M shape\n",
    "assert (M_test.shape == M_test_ans.shape), \"M matrix has incorrect shape.\\nCorrect: {}\\nYours: {}\".format(M_test.shape, M_test_ans.shape)\n",
    "\n",
    "# Test correct M values\n",
    "for w1 in word2ind_ans.keys():\n",
    "    idx1 = word2ind_ans[w1]\n",
    "    for w2 in word2ind_ans.keys():\n",
    "        idx2 = word2ind_ans[w2]\n",
    "        student = M_test[idx1, idx2]\n",
    "        correct = M_test_ans[idx1, idx2]\n",
    "        if student != correct:\n",
    "            print(\"Correct M:\")\n",
    "            print(M_test_ans)\n",
    "            print(\"Your M: \")\n",
    "            print(M_test)\n",
    "            raise AssertionError(\"Incorrect count at index ({}, {})=({}, {}) in matrix M. Yours has {} but should have {}.\".format(idx1, idx2, w1, w2, student, correct))\n",
    "\n",
    "# Print Success\n",
    "print (\"-\" * 80)\n",
    "print(\"Passed All Tests!\")\n",
    "print (\"-\" * 80)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 1.3: Implement `reduce_to_k_dim` [code] (1 point)\n",
    "\n",
    "Construct a method that performs dimensionality reduction on the matrix to produce k-dimensional embeddings. Use SVD to take the top k components and produce a new matrix of k-dimensional embeddings. \n",
    "\n",
    "**Note:** All of numpy, scipy, and scikit-learn (`sklearn`) provide *some* implementation of SVD, but only scipy and sklearn provide an implementation of Truncated SVD, and only sklearn provides an efficient randomized algorithm for calculating large-scale Truncated SVD. So please use [sklearn.decomposition.TruncatedSVD](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def reduce_to_k_dim(M, k=2):\n",
    "    \"\"\" Reduce a co-occurence count matrix of dimensionality (num_corpus_words, num_corpus_words)\n",
    "        to a matrix of dimensionality (num_corpus_words, k) using the following SVD function from Scikit-Learn:\n",
    "            - http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html\n",
    "    \n",
    "        Params:\n",
    "            M (numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)): co-occurence matrix of word counts\n",
    "            k (int): embedding size of each word after dimension reduction\n",
    "        Return:\n",
    "            M_reduced (numpy matrix of shape (number of corpus words, k)): matrix of k-dimensioal word embeddings.\n",
    "                    In terms of the SVD from math class, this actually returns U * S\n",
    "    \"\"\"    \n",
    "    n_iters = 10     # Use this parameter in your call to `TruncatedSVD`\n",
    "    M_reduced = None\n",
    "    print(\"Running Truncated SVD over %i words...\" % (M.shape[0]))\n",
    "    \n",
    "        # ------------------\n",
    "        # Write your implementation here.\n",
    "    \n",
    "    \n",
    "        # ------------------\n",
    "\n",
    "    print(\"Done.\")\n",
    "    return M_reduced"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ---------------------\n",
    "# Run this sanity check\n",
    "# Note that this is not an exhaustive check for correctness \n",
    "# In fact we only check that your M_reduced has the right dimensions.\n",
    "# ---------------------\n",
    "\n",
    "# Define toy corpus and run student code\n",
    "test_corpus = [\"{} All that glitters isn't gold {}\".format(START_TOKEN, END_TOKEN).split(\" \"), \"{} All's well that ends well {}\".format(START_TOKEN, END_TOKEN).split(\" \")]\n",
    "M_test, word2ind_test = compute_co_occurrence_matrix(test_corpus, window_size=1)\n",
    "M_test_reduced = reduce_to_k_dim(M_test, k=2)\n",
    "\n",
    "# Test proper dimensions\n",
    "assert (M_test_reduced.shape[0] == 10), \"M_reduced has {} rows; should have {}\".format(M_test_reduced.shape[0], 10)\n",
    "assert (M_test_reduced.shape[1] == 2), \"M_reduced has {} columns; should have {}\".format(M_test_reduced.shape[1], 2)\n",
    "\n",
    "# Print Success\n",
    "print (\"-\" * 80)\n",
    "print(\"Passed All Tests!\")\n",
    "print (\"-\" * 80)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 1.4: Implement `plot_embeddings` [code] (1 point)\n",
    "\n",
    "Here you will write a function to plot a set of 2D vectors in 2D space. For graphs, we will use Matplotlib (`plt`).\n",
    "\n",
    "For this example, you may find it useful to adapt [this code](http://web.archive.org/web/20190924160434/https://www.pythonmembers.club/2018/05/08/matplotlib-scatter-plot-annotate-set-text-at-label-each-point/). In the future, a good way to make a plot is to look at [the Matplotlib gallery](https://matplotlib.org/gallery/index.html), find a plot that looks somewhat like what you want, and adapt the code they give."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_embeddings(M_reduced, word2ind, words):\n",
    "    \"\"\" Plot in a scatterplot the embeddings of the words specified in the list \"words\".\n",
    "        NOTE: do not plot all the words listed in M_reduced / word2ind.\n",
    "        Include a label next to each point.\n",
    "        \n",
    "        Params:\n",
    "            M_reduced (numpy matrix of shape (number of unique words in the corpus , 2)): matrix of 2-dimensioal word embeddings\n",
    "            word2ind (dict): dictionary that maps word to indices for matrix M\n",
    "            words (list of strings): words whose embeddings we want to visualize\n",
    "    \"\"\"\n",
    "\n",
    "    # ------------------\n",
    "    # Write your implementation here.\n",
    "\n",
    "\n",
    "    # ------------------"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ---------------------\n",
    "# Run this sanity check\n",
    "# Note that this is not an exhaustive check for correctness.\n",
    "# The plot produced should look like the \"test solution plot\" depicted below. \n",
    "# ---------------------\n",
    "\n",
    "print (\"-\" * 80)\n",
    "print (\"Outputted Plot:\")\n",
    "\n",
    "M_reduced_plot_test = np.array([[1, 1], [-1, -1], [1, -1], [-1, 1], [0, 0]])\n",
    "word2ind_plot_test = {'test1': 0, 'test2': 1, 'test3': 2, 'test4': 3, 'test5': 4}\n",
    "words = ['test1', 'test2', 'test3', 'test4', 'test5']\n",
    "plot_embeddings(M_reduced_plot_test, word2ind_plot_test, words)\n",
    "\n",
    "print (\"-\" * 80)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font color=red>**Test Plot Solution**</font>\n",
    "<br>\n",
    "<img src=\"./imgs/test_plot.png\" width=40% style=\"float: left;\"> </img>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 1.5: Co-Occurrence Plot Analysis [written] (3 points)\n",
    "\n",
    "Now we will put together all the parts you have written! We will compute the co-occurrence matrix with fixed window of 4 (the default window size), over the Reuters \"crude\" (oil) corpus. Then we will use TruncatedSVD to compute 2-dimensional embeddings of each word. TruncatedSVD returns U\\*S, so we need to normalize the returned vectors, so that all the vectors will appear around the unit circle (therefore closeness is directional closeness). **Note**: The line of code below that does the normalizing uses the NumPy concept of *broadcasting*. If you don't know about broadcasting, check out\n",
    "[Computation on Arrays: Broadcasting by Jake VanderPlas](https://jakevdp.github.io/PythonDataScienceHandbook/02.05-computation-on-arrays-broadcasting.html).\n",
    "\n",
    "Run the below cell to produce the plot. It'll probably take a few seconds to run. What clusters together in 2-dimensional embedding space? What doesn't cluster together that you might think should have?  **Note:** \"bpd\" stands for \"barrels per day\" and is a commonly used abbreviation in crude oil topic articles."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# -----------------------------\n",
    "# Run This Cell to Produce Your Plot\n",
    "# ------------------------------\n",
    "reuters_corpus = read_corpus()\n",
    "M_co_occurrence, word2ind_co_occurrence = compute_co_occurrence_matrix(reuters_corpus)\n",
    "M_reduced_co_occurrence = reduce_to_k_dim(M_co_occurrence, k=2)\n",
    "\n",
    "# Rescale (normalize) the rows to make them each of unit-length\n",
    "M_lengths = np.linalg.norm(M_reduced_co_occurrence, axis=1)\n",
    "M_normalized = M_reduced_co_occurrence / M_lengths[:, np.newaxis] # broadcasting\n",
    "\n",
    "words = ['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'iraq']\n",
    "\n",
    "plot_embeddings(M_normalized, word2ind_co_occurrence, words)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### <font color=\"red\">Write your answer here.</font>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 2: Prediction-Based Word Vectors (15 points)\n",
    "\n",
    "As discussed in class, more recently prediction-based word vectors have demonstrated better performance, such as word2vec and GloVe (which also utilizes the benefit of counts). Here, we shall explore the embeddings produced by GloVe. Please revisit the class notes and lecture slides for more details on the word2vec and GloVe algorithms. If you're feeling adventurous, challenge yourself and try reading [GloVe's original paper](https://nlp.stanford.edu/pubs/glove.pdf).\n",
    "\n",
    "Then run the following cells to load the GloVe vectors into memory. **Note**: If this is your first time to run these cells, i.e. download the embedding model, it will take a couple minutes to run. If you've run these cells before, rerunning them will load the model without redownloading it, which will take about 1 to 2 minutes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def load_embedding_model():\n",
    "    \"\"\" Load GloVe Vectors\n",
    "        Return:\n",
    "            wv_from_bin: All 400000 embeddings, each lengh 200\n",
    "    \"\"\"\n",
    "    import gensim.downloader as api\n",
    "    wv_from_bin = api.load(\"glove-wiki-gigaword-200\")\n",
    "    print(\"Loaded vocab size %i\" % len(wv_from_bin.vocab.keys()))\n",
    "    return wv_from_bin"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# -----------------------------------\n",
    "# Run Cell to Load Word Vectors\n",
    "# Note: This will take a couple minutes\n",
    "# -----------------------------------\n",
    "wv_from_bin = load_embedding_model()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Note: If you are receiving a \"reset by peer\" error, rerun the cell to restart the download. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Reducing dimensionality of Word Embeddings\n",
    "Let's directly compare the GloVe embeddings to those of the co-occurrence matrix. In order to avoid running out of memory, we will work with a sample of 10000 GloVe vectors instead.\n",
    "Run the following cells to:\n",
    "\n",
    "1. Put 10000 Glove vectors into a matrix M\n",
    "2. Run `reduce_to_k_dim` (your Truncated SVD function) to reduce the vectors from 200-dimensional to 2-dimensional."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_matrix_of_vectors(wv_from_bin, required_words=['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'iraq']):\n",
    "    \"\"\" Put the GloVe vectors into a matrix M.\n",
    "        Param:\n",
    "            wv_from_bin: KeyedVectors object; the 400000 GloVe vectors loaded from file\n",
    "        Return:\n",
    "            M: numpy matrix shape (num words, 200) containing the vectors\n",
    "            word2ind: dictionary mapping each word to its row number in M\n",
    "    \"\"\"\n",
    "    import random\n",
    "    words = list(wv_from_bin.vocab.keys())\n",
    "    print(\"Shuffling words ...\")\n",
    "    random.seed(224)\n",
    "    random.shuffle(words)\n",
    "    words = words[:10000]\n",
    "    print(\"Putting %i words into word2ind and matrix M...\" % len(words))\n",
    "    word2ind = {}\n",
    "    M = []\n",
    "    curInd = 0\n",
    "    for w in words:\n",
    "        try:\n",
    "            M.append(wv_from_bin.word_vec(w))\n",
    "            word2ind[w] = curInd\n",
    "            curInd += 1\n",
    "        except KeyError:\n",
    "            continue\n",
    "    for w in required_words:\n",
    "        if w in words:\n",
    "            continue\n",
    "        try:\n",
    "            M.append(wv_from_bin.word_vec(w))\n",
    "            word2ind[w] = curInd\n",
    "            curInd += 1\n",
    "        except KeyError:\n",
    "            continue\n",
    "    M = np.stack(M)\n",
    "    print(\"Done.\")\n",
    "    return M, word2ind"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# -----------------------------------------------------------------\n",
    "# Run Cell to Reduce 200-Dimensional Word Embeddings to k Dimensions\n",
    "# Note: This should be quick to run\n",
    "# -----------------------------------------------------------------\n",
    "M, word2ind = get_matrix_of_vectors(wv_from_bin)\n",
    "M_reduced = reduce_to_k_dim(M, k=2)\n",
    "\n",
    "# Rescale (normalize) the rows to make them each of unit-length\n",
    "M_lengths = np.linalg.norm(M_reduced, axis=1)\n",
    "M_reduced_normalized = M_reduced / M_lengths[:, np.newaxis] # broadcasting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Note: If you are receiving out of memory issues on your local machine, try closing other applications to free more memory on your device. You may want to try restarting your machine so that you can free up extra memory. Then immediately run the jupyter notebook and see if you can load the word vectors properly. If you still have problems with loading the embeddings onto your local machine after this, please go to office hours or contact course staff.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 2.1: GloVe Plot Analysis [written] (3 points)\n",
    "\n",
    "Run the cell below to plot the 2D GloVe embeddings for `['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'iraq']`.\n",
    "\n",
    "What clusters together in 2-dimensional embedding space? What doesn't cluster together that you think should have? How is the plot different from the one generated earlier from the co-occurrence matrix? What is a possible cause for the difference?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "words = ['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'iraq']\n",
    "plot_embeddings(M_reduced_normalized, word2ind, words)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### <font color=\"red\">Write your answer here.</font>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Cosine Similarity\n",
    "Now that we have word vectors, we need a way to quantify the similarity between individual words, according to these vectors. One such metric is cosine-similarity. We will be using this to find words that are \"close\" and \"far\" from one another.\n",
    "\n",
    "We can think of n-dimensional vectors as points in n-dimensional space. If we take this perspective [L1](http://mathworld.wolfram.com/L1-Norm.html) and [L2](http://mathworld.wolfram.com/L2-Norm.html) Distances help quantify the amount of space \"we must travel\" to get between these two points. Another approach is to examine the angle between two vectors. From trigonometry we know that:\n",
    "\n",
    "<img src=\"./imgs/inner_product.png\" width=20% style=\"float: center;\"></img>\n",
    "\n",
    "Instead of computing the actual angle, we can leave the similarity in terms of $similarity = cos(\\Theta)$. Formally the [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity) $s$ between two vectors $p$ and $q$ is defined as:\n",
    "\n",
    "$$s = \\frac{p \\cdot q}{||p|| ||q||}, \\textrm{ where } s \\in [-1, 1] $$ "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 2.2: Words with Multiple Meanings (1.5 points) [code + written] \n",
    "Polysemes and homonyms are words that have more than one meaning (see this [wiki page](https://en.wikipedia.org/wiki/Polysemy) to learn more about the difference between polysemes and homonyms ). Find a word with *at least two different meanings* such that the top-10 most similar words (according to cosine similarity) contain related words from *both* meanings. For example, \"leaves\" has both \"go_away\" and \"a_structure_of_a_plant\" meaning in the top 10, and \"scoop\" has both \"handed_waffle_cone\" and \"lowdown\". You will probably need to try several polysemous or homonymic words before you find one. \n",
    "\n",
    "Please state the word you discover and the multiple meanings that occur in the top 10. Why do you think many of the polysemous or homonymic words you tried didn't work (i.e. the top-10 most similar words only contain **one** of the meanings of the words)?\n",
    "\n",
    "**Note**: You should use the `wv_from_bin.most_similar(word)` function to get the top 10 similar words. This function ranks all other words in the vocabulary with respect to their cosine similarity to the given word. For further assistance, please check the __[GenSim documentation](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.FastTextKeyedVectors.most_similar)__."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "    # ------------------\n",
    "    # Write your implementation here.\n",
    "\n",
    "\n",
    "    # ------------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### <font color=\"red\">Write your answer here.</font>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 2.3: Synonyms & Antonyms (2 points) [code + written] \n",
    "\n",
    "When considering Cosine Similarity, it's often more convenient to think of Cosine Distance, which is simply 1 - Cosine Similarity.\n",
    "\n",
    "Find three words $(w_1,w_2,w_3)$ where $w_1$ and $w_2$ are synonyms and $w_1$ and $w_3$ are antonyms, but Cosine Distance $(w_1,w_3) <$ Cosine Distance $(w_1,w_2)$. \n",
    "\n",
    "As an example, $w_1$=\"happy\" is closer to $w_3$=\"sad\" than to $w_2$=\"cheerful\". Please find a different example that satisfies the above. Once you have found your example, please give a possible explanation for why this counter-intuitive result may have happened.\n",
    "\n",
    "You should use the the `wv_from_bin.distance(w1, w2)` function here in order to compute the cosine distance between two words. Please see the __[GenSim documentation](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.FastTextKeyedVectors.distance)__ for further assistance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "    # ------------------\n",
    "    # Write your implementation here.\n",
    "\n",
    "\n",
    "    # ------------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### <font color=\"red\">Write your answer here.</font>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 2.4: Analogies with Word Vectors [written] (1.5 points)\n",
    "Word vectors have been shown to *sometimes* exhibit the ability to solve analogies. \n",
    "\n",
    "As an example, for the analogy \"man : king :: woman : x\" (read: man is to king as woman is to x), what is x?\n",
    "\n",
    "In the cell below, we show you how to use word vectors to find x using the `most_similar` function from the __[GenSim documentation](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.most_similar)__. The function finds words that are most similar to the words in the `positive` list and most dissimilar from the words in the `negative` list (while omitting the input words, which are often the most similar; see [this paper](https://www.aclweb.org/anthology/N18-2039.pdf)). The answer to the analogy will have the highest cosine similarity (largest returned numerical value)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Run this cell to answer the analogy -- man : king :: woman : x\n",
    "pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'king'], negative=['man']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let $m$, $k$, $w$, and $x$ denote the word vectors for `man`, `king`, `woman`, and the answer, respectively. Using **only** vectors $m$, $k$, $w$, and the vector arithmetic operators $+$ and $-$ in your answer, what is the expression in which we are maximizing cosine similarity with $x$?\n",
    "\n",
    "Hint: Recall that word vectors are simply multi-dimensional vectors that represent a word. It might help to draw out a 2D example using arbitrary locations of each vector. Where would `man` and `woman` lie in the coordinate plane relative to `king` and the answer?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### <font color=\"red\">Write your answer here.</font>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 2.5: Finding Analogies [code + written]  (1.5 points)\n",
    "Find an example of analogy that holds according to these vectors (i.e. the intended word is ranked top). In your solution please state the full analogy in the form x:y :: a:b. If you believe the analogy is complicated, explain why the analogy holds in one or two sentences.\n",
    "\n",
    "**Note**: You may have to try many analogies to find one that works!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "    # ------------------\n",
    "    # Write your implementation here.\n",
    "\n",
    "\n",
    "    # ------------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### <font color=\"red\">Write your answer here.</font>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 2.6: Incorrect Analogy [code + written] (1.5 points)\n",
    "Find an example of analogy that does *not* hold according to these vectors. In your solution, state the intended analogy in the form x:y :: a:b, and state the (incorrect) value of b according to the word vectors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "    # ------------------\n",
    "    # Write your implementation here.\n",
    "\n",
    "\n",
    "    # ------------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### <font color=\"red\">Write your answer here.</font>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 2.7: Guided Analysis of Bias in Word Vectors [written] (1 point)\n",
    "\n",
    "It's important to be cognizant of the biases (gender, race, sexual orientation etc.) implicit in our word embeddings. Bias can be dangerous because it can reinforce stereotypes through applications that employ these models.\n",
    "\n",
    "Run the cell below, to examine (a) which terms are most similar to \"woman\" and \"worker\" and most dissimilar to \"man\", and (b) which terms are most similar to \"man\" and \"worker\" and most dissimilar to \"woman\". Point out the difference between the list of female-associated words and the list of male-associated words, and explain how it is reflecting gender bias."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Run this cell\n",
    "# Here `positive` indicates the list of words to be similar to and `negative` indicates the list of words to be\n",
    "# most dissimilar from.\n",
    "pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'worker'], negative=['man']))\n",
    "print()\n",
    "pprint.pprint(wv_from_bin.most_similar(positive=['man', 'worker'], negative=['woman']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### <font color=\"red\">Write your answer here.</font>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 2.8: Independent Analysis of Bias in Word Vectors [code + written]  (1 point)\n",
    "\n",
    "Use the `most_similar` function to find another case where some bias is exhibited by the vectors. Please briefly explain the example of bias that you discover."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "    # ------------------\n",
    "    # Write your implementation here.\n",
    "\n",
    "\n",
    "    # ------------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### <font color=\"red\">Write your answer here.</font>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 2.9: Thinking About Bias [written] (2 points)\n",
    "\n",
    "Give one explanation of how bias gets into the word vectors. What is an experiment that you could do to test for or to measure this source of bias?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### <font color=\"red\">Write your answer here.</font>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <font color=\"blue\"> Submission Instructions</font>\n",
    "\n",
    "1. Click the Save button at the top of the Jupyter Notebook.\n",
    "2. Select Cell -> All Output -> Clear. This will clear all the outputs from all cells (but will keep the content of all cells). \n",
    "2. Select Cell -> Run All. This will run all the cells in order, and will take several minutes.\n",
    "3. Once you've rerun everything, select File -> Download as -> PDF via LaTeX (If you have trouble using \"PDF via LaTex\", you can also save the webpage as pdf. <font color='blue'> Make sure all your solutions especially the coding parts are displayed in the pdf</font>, it's okay if the provided codes get cut off because lines are not wrapped in code cells).\n",
    "4. Look at the PDF file and make sure all your solutions are there, displayed correctly. The PDF is the only thing your graders will see!\n",
    "5. Submit your PDF on Gradescope."
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: a1/exploring_word_vectors_solved.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# CS224N Assignment 1: Exploring Word Vectors (25 Points)\n",
    "### <font color='blue'> Due 4:30pm, Tue Jan 19 </font>\n",
    "\n",
    "Welcome to CS224N! \n",
    "\n",
    "Before you start, make sure you read the README.txt in the same directory as this notebook for important setup information. A lot of code is provided in this notebook, and we highly encourage you to read and understand it as part of the learning :)\n",
    "\n",
    "If you aren't super familiar with Python, Numpy, or Matplotlib, we recommend you check out the review session on Friday. The session will be recorded and the material will be made available on our [website](http://web.stanford.edu/class/cs224n/index.html#schedule). The CS231N Python/Numpy [tutorial](https://cs231n.github.io/python-numpy-tutorial/) is also a great resource.\n",
    "\n",
    "\n",
    "**Assignment Notes:** Please make sure to save the notebook as you go along. Submission Instructions are located at the bottom of the notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package reuters to /Users/Aman/nltk_data...\n",
      "[nltk_data]   Package reuters is already up-to-date!\n"
     ]
    }
   ],
   "source": [
    "# All Import Statements Defined Here\n",
    "# Note: Do not add to this list.\n",
    "# ----------------\n",
    "\n",
    "import sys\n",
    "assert sys.version_info[0]==3\n",
    "assert sys.version_info[1] >= 5\n",
    "\n",
    "from gensim.models import KeyedVectors\n",
    "from gensim.test.utils import datapath\n",
    "import pprint\n",
    "import matplotlib.pyplot as plt\n",
    "plt.rcParams['figure.figsize'] = [10, 5]\n",
    "import nltk\n",
    "nltk.download('reuters')\n",
    "from nltk.corpus import reuters\n",
    "import numpy as np\n",
    "import random\n",
    "import scipy as sp\n",
    "from sklearn.decomposition import TruncatedSVD\n",
    "from sklearn.decomposition import PCA\n",
    "\n",
    "START_TOKEN = '<START>'\n",
    "END_TOKEN = '<END>'\n",
    "\n",
    "np.random.seed(0)\n",
    "random.seed(0)\n",
    "# ----------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Word Vectors\n",
    "\n",
    "Word Vectors are often used as a fundamental component for downstream NLP tasks, e.g. question answering, text generation, translation, etc., so it is important to build some intuitions as to their strengths and weaknesses. Here, you will explore two types of word vectors: those derived from *co-occurrence matrices*, and those derived via *GloVe*. \n",
    "\n",
    "**Note on Terminology:** The terms \"word vectors\" and \"word embeddings\" are often used interchangeably. The term \"embedding\" refers to the fact that we are encoding aspects of a word's meaning in a lower dimensional space. As [Wikipedia](https://en.wikipedia.org/wiki/Word_embedding) states, \"*conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with a much lower dimension*\"."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 1: Count-Based Word Vectors (10 points)\n",
    "\n",
    "Most word vector models start from the following idea:\n",
    "\n",
    "*You shall know a word by the company it keeps ([Firth, J. R. 1957:11](https://en.wikipedia.org/wiki/John_Rupert_Firth))*\n",
    "\n",
    "Many word vector implementations are driven by the idea that similar words, i.e., (near) synonyms, will be used in similar contexts. As a result, similar words will often be spoken or written along with a shared subset of words, i.e., contexts. By examining these contexts, we can try to develop embeddings for our words. With this intuition in mind, many \"old school\" approaches to constructing word vectors relied on word counts. Here we elaborate upon one of those strategies, *co-occurrence matrices* (for more information, see [here](http://web.stanford.edu/class/cs124/lec/vectorsemantics.video.pdf) or [here](https://medium.com/data-science-group-iitr/word-embedding-2d05d270b285))."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Co-Occurrence\n",
    "\n",
    "A co-occurrence matrix counts how often things co-occur in some environment. Given some word $w_i$ occurring in the document, we consider the *context window* surrounding $w_i$. Supposing our fixed window size is $n$, then this is the $n$ preceding and $n$ subsequent words in that document, i.e. words $w_{i-n} \\dots w_{i-1}$ and $w_{i+1} \\dots w_{i+n}$. We build a *co-occurrence matrix* $M$, which is a symmetric word-by-word matrix in which $M_{ij}$ is the number of times $w_j$ appears inside $w_i$'s window among all documents.\n",
    "\n",
    "**Example: Co-Occurrence with Fixed Window of n=1**:\n",
    "\n",
    "Document 1: \"all that glitters is not gold\"\n",
    "\n",
    "Document 2: \"all is well that ends well\"\n",
    "\n",
    "\n",
    "|     *    | `<START>` | all | that | glitters | is   | not  | gold  | well | ends | `<END>` |\n",
    "|----------|-------|-----|------|----------|------|------|-------|------|------|-----|\n",
    "| `<START>`    | 0     | 2   | 0    | 0        | 0    | 0    | 0     | 0    | 0    | 0   |\n",
    "| all      | 2     | 0   | 1    | 0        | 1    | 0    | 0     | 0    | 0    | 0   |\n",
    "| that     | 0     | 1   | 0    | 1        | 0    | 0    | 0     | 1    | 1    | 0   |\n",
    "| glitters | 0     | 0   | 1    | 0        | 1    | 0    | 0     | 0    | 0    | 0   |\n",
    "| is       | 0     | 1   | 0    | 1        | 0    | 1    | 0     | 1    | 0    | 0   |\n",
    "| not      | 0     | 0   | 0    | 0        | 1    | 0    | 1     | 0    | 0    | 0   |\n",
    "| gold     | 0     | 0   | 0    | 0        | 0    | 1    | 0     | 0    | 0    | 1   |\n",
    "| well     | 0     | 0   | 1    | 0        | 1    | 0    | 0     | 0    | 1    | 1   |\n",
    "| ends     | 0     | 0   | 1    | 0        | 0    | 0    | 0     | 1    | 0    | 0   |\n",
    "| `<END>`      | 0     | 0   | 0    | 0        | 0    | 0    | 1     | 1    | 0    | 0   |\n",
    "\n",
    "**Note:** In NLP, we often add `<START>` and `<END>` tokens to represent the beginning and end of sentences, paragraphs or documents. In thise case we imagine `<START>` and `<END>` tokens encapsulating each document, e.g., \"`<START>` All that glitters is not gold `<END>`\", and include these tokens in our co-occurrence counts.\n",
    "\n",
    "The rows (or columns) of this matrix provide one type of word vectors (those based on word-word co-occurrence), but the vectors will be large in general (linear in the number of distinct words in a corpus). Thus, our next step is to run *dimensionality reduction*. In particular, we will run *SVD (Singular Value Decomposition)*, which is a kind of generalized *PCA (Principal Components Analysis)* to select the top $k$ principal components. Here's a visualization of dimensionality reduction with SVD. In this picture our co-occurrence matrix is $A$ with $n$ rows corresponding to $n$ words. We obtain a full matrix decomposition, with the singular values ordered in the diagonal $S$ matrix, and our new, shorter length-$k$ word vectors in $U_k$.\n",
    "\n",
    "![Picture of an SVD](imgs/svd.png \"SVD\")\n",
    "\n",
    "This reduced-dimensionality co-occurrence representation preserves semantic relationships between words, e.g. *doctor* and *hospital* will be closer than *doctor* and *dog*. \n",
    "\n",
    "**Notes:** If you can barely remember what an eigenvalue is, here's [a slow, friendly introduction to SVD](https://davetang.org/file/Singular_Value_Decomposition_Tutorial.pdf). If you want to learn more thoroughly about PCA or SVD, feel free to check out lectures [7](https://web.stanford.edu/class/cs168/l/l7.pdf), [8](http://theory.stanford.edu/~tim/s15/l/l8.pdf), and [9](https://web.stanford.edu/class/cs168/l/l9.pdf) of CS168. These course notes provide a great high-level treatment of these general purpose algorithms. Though, for the purpose of this class, you only need to know how to extract the k-dimensional embeddings by utilizing pre-programmed implementations of these algorithms from the numpy, scipy, or sklearn python packages. In practice, it is challenging to apply full SVD to large corpora because of the memory needed to perform PCA or SVD. However, if you only want the top $k$ vector components for relatively small $k$ — known as [Truncated SVD](https://en.wikipedia.org/wiki/Singular_value_decomposition#Truncated_SVD) — then there are reasonably scalable techniques to compute those iteratively."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Plotting Co-Occurrence Word Embeddings\n",
    "\n",
    "Here, we will be using the Reuters (business and financial news) corpus. If you haven't run the import cell at the top of this page, please run it now (click it and press SHIFT-RETURN). The corpus consists of 10,788 news documents totaling 1.3 million words. These documents span 90 categories and are split into train and test. For more details, please see https://www.nltk.org/book/ch02.html. We provide a `read_corpus` function below that pulls out only articles from the \"crude\" (i.e. news articles about oil, gas, etc.) category. The function also adds `<START>` and `<END>` tokens to each of the documents, and lowercases words. You do **not** have to perform any other kind of pre-processing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def read_corpus(category=\"crude\"):\n",
    "    \"\"\" Read files from the specified Reuter's category.\n",
    "        Params:\n",
    "            category (string): category name\n",
    "        Return:\n",
    "            list of lists, with words from each of the processed files\n",
    "    \"\"\"\n",
    "    files = reuters.fileids(category)\n",
    "    return [[START_TOKEN] + [w.lower() for w in list(reuters.words(f))] + [END_TOKEN] for f in files]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's have a look what these documents are like…."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[['<START>', 'japan', 'to', 'revise', 'long', '-', 'term', 'energy', 'demand', 'downwards', 'the',\n",
      "  'ministry', 'of', 'international', 'trade', 'and', 'industry', '(', 'miti', ')', 'will', 'revise',\n",
      "  'its', 'long', '-', 'term', 'energy', 'supply', '/', 'demand', 'outlook', 'by', 'august', 'to',\n",
      "  'meet', 'a', 'forecast', 'downtrend', 'in', 'japanese', 'energy', 'demand', ',', 'ministry',\n",
      "  'officials', 'said', '.', 'miti', 'is', 'expected', 'to', 'lower', 'the', 'projection', 'for',\n",
      "  'primary', 'energy', 'supplies', 'in', 'the', 'year', '2000', 'to', '550', 'mln', 'kilolitres',\n",
      "  '(', 'kl', ')', 'from', '600', 'mln', ',', 'they', 'said', '.', 'the', 'decision', 'follows',\n",
      "  'the', 'emergence', 'of', 'structural', 'changes', 'in', 'japanese', 'industry', 'following',\n",
      "  'the', 'rise', 'in', 'the', 'value', 'of', 'the', 'yen', 'and', 'a', 'decline', 'in', 'domestic',\n",
      "  'electric', 'power', 'demand', '.', 'miti', 'is', 'planning', 'to', 'work', 'out', 'a', 'revised',\n",
      "  'energy', 'supply', '/', 'demand', 'outlook', 'through', 'deliberations', 'of', 'committee',\n",
      "  'meetings', 'of', 'the', 'agency', 'of', 'natural', 'resources', 'and', 'energy', ',', 'the',\n",
      "  'officials', 'said', '.', 'they', 'said', 'miti', 'will', 'also', 'review', 'the', 'breakdown',\n",
      "  'of', 'energy', 'supply', 'sources', ',', 'including', 'oil', ',', 'nuclear', ',', 'coal', 'and',\n",
      "  'natural', 'gas', '.', 'nuclear', 'energy', 'provided', 'the', 'bulk', 'of', 'japan', \"'\", 's',\n",
      "  'electric', 'power', 'in', 'the', 'fiscal', 'year', 'ended', 'march', '31', ',', 'supplying',\n",
      "  'an', 'estimated', '27', 'pct', 'on', 'a', 'kilowatt', '/', 'hour', 'basis', ',', 'followed',\n",
      "  'by', 'oil', '(', '23', 'pct', ')', 'and', 'liquefied', 'natural', 'gas', '(', '21', 'pct', '),',\n",
      "  'they', 'noted', '.', '<END>'],\n",
      " ['<START>', 'energy', '/', 'u', '.', 's', '.', 'petrochemical', 'industry', 'cheap', 'oil',\n",
      "  'feedstocks', ',', 'the', 'weakened', 'u', '.', 's', '.', 'dollar', 'and', 'a', 'plant',\n",
      "  'utilization', 'rate', 'approaching', '90', 'pct', 'will', 'propel', 'the', 'streamlined', 'u',\n",
      "  '.', 's', '.', 'petrochemical', 'industry', 'to', 'record', 'profits', 'this', 'year', ',',\n",
      "  'with', 'growth', 'expected', 'through', 'at', 'least', '1990', ',', 'major', 'company',\n",
      "  'executives', 'predicted', '.', 'this', 'bullish', 'outlook', 'for', 'chemical', 'manufacturing',\n",
      "  'and', 'an', 'industrywide', 'move', 'to', 'shed', 'unrelated', 'businesses', 'has', 'prompted',\n",
      "  'gaf', 'corp', '&', 'lt', ';', 'gaf', '>,', 'privately', '-', 'held', 'cain', 'chemical', 'inc',\n",
      "  ',', 'and', 'other', 'firms', 'to', 'aggressively', 'seek', 'acquisitions', 'of', 'petrochemical',\n",
      "  'plants', '.', 'oil', 'companies', 'such', 'as', 'ashland', 'oil', 'inc', '&', 'lt', ';', 'ash',\n",
      "  '>,', 'the', 'kentucky', '-', 'based', 'oil', 'refiner', 'and', 'marketer', ',', 'are', 'also',\n",
      "  'shopping', 'for', 'money', '-', 'making', 'petrochemical', 'businesses', 'to', 'buy', '.', '\"',\n",
      "  'i', 'see', 'us', 'poised', 'at', 'the', 'threshold', 'of', 'a', 'golden', 'period', ',\"', 'said',\n",
      "  'paul', 'oreffice', ',', 'chairman', 'of', 'giant', 'dow', 'chemical', 'co', '&', 'lt', ';',\n",
      "  'dow', '>,', 'adding', ',', '\"', 'there', \"'\", 's', 'no', 'major', 'plant', 'capacity', 'being',\n",
      "  'added', 'around', 'the', 'world', 'now', '.', 'the', 'whole', 'game', 'is', 'bringing', 'out',\n",
      "  'new', 'products', 'and', 'improving', 'the', 'old', 'ones', '.\"', 'analysts', 'say', 'the',\n",
      "  'chemical', 'industry', \"'\", 's', 'biggest', 'customers', ',', 'automobile', 'manufacturers',\n",
      "  'and', 'home', 'builders', 'that', 'use', 'a', 'lot', 'of', 'paints', 'and', 'plastics', ',',\n",
      "  'are', 'expected', 'to', 'buy', 'quantities', 'this', 'year', '.', 'u', '.', 's', '.',\n",
      "  'petrochemical', 'plants', 'are', 'currently', 'operating', 'at', 'about', '90', 'pct',\n",
      "  'capacity', ',', 'reflecting', 'tighter', 'supply', 'that', 'could', 'hike', 'product', 'prices',\n",
      "  'by', '30', 'to', '40', 'pct', 'this', 'year', ',', 'said', 'john', 'dosher', ',', 'managing',\n",
      "  'director', 'of', 'pace', 'consultants', 'inc', 'of', 'houston', '.', 'demand', 'for', 'some',\n",
      "  'products', 'such', 'as', 'styrene', 'could', 'push', 'profit', 'margins', 'up', 'by', 'as',\n",
      "  'much', 'as', '300', 'pct', ',', 'he', 'said', '.', 'oreffice', ',', 'speaking', 'at', 'a',\n",
      "  'meeting', 'of', 'chemical', 'engineers', 'in', 'houston', ',', 'said', 'dow', 'would', 'easily',\n",
      "  'top', 'the', '741', 'mln', 'dlrs', 'it', 'earned', 'last', 'year', 'and', 'predicted', 'it',\n",
      "  'would', 'have', 'the', 'best', 'year', 'in', 'its', 'history', '.', 'in', '1985', ',', 'when',\n",
      "  'oil', 'prices', 'were', 'still', 'above', '25', 'dlrs', 'a', 'barrel', 'and', 'chemical',\n",
      "  'exports', 'were', 'adversely', 'affected', 'by', 'the', 'strong', 'u', '.', 's', '.', 'dollar',\n",
      "  ',', 'dow', 'had', 'profits', 'of', '58', 'mln', 'dlrs', '.', '\"', 'i', 'believe', 'the',\n",
      "  'entire', 'chemical', 'industry', 'is', 'headed', 'for', 'a', 'record', 'year', 'or', 'close',\n",
      "  'to', 'it', ',\"', 'oreffice', 'said', '.', 'gaf', 'chairman', 'samuel', 'heyman', 'estimated',\n",
      "  'that', 'the', 'u',

Download .txt

gitextract_w5crukyc/

├── README.md
├── a1/
│   ├── .ipynb_checkpoints/
│   │   └── exploring_word_vectors-checkpoint.ipynb
│   ├── Gensim word vector visualization.ipynb
│   ├── README.txt
│   ├── env.yml
│   ├── exploring_word_vectors.ipynb
│   └── exploring_word_vectors_solved.ipynb
├── a2/
│   ├── collect_submission.sh
│   ├── env.yml
│   ├── get_datasets.sh
│   ├── run.py
│   ├── saved_params_10000.npy
│   ├── saved_params_15000.npy
│   ├── saved_params_20000.npy
│   ├── saved_params_25000.npy
│   ├── saved_params_30000.npy
│   ├── saved_params_35000.npy
│   ├── saved_params_40000.npy
│   ├── saved_params_5000.npy
│   ├── saved_state_10000.pickle
│   ├── saved_state_15000.pickle
│   ├── saved_state_20000.pickle
│   ├── saved_state_25000.pickle
│   ├── saved_state_30000.pickle
│   ├── saved_state_35000.pickle
│   ├── saved_state_40000.pickle
│   ├── saved_state_5000.pickle
│   ├── sgd.py
│   ├── utils/
│   │   ├── __init__.py
│   │   ├── datasets/
│   │   │   └── stanfordSentimentTreebank/
│   │   │       ├── README.txt
│   │   │       ├── SOStr.txt
│   │   │       ├── STree.txt
│   │   │       ├── datasetSentences.txt
│   │   │       ├── datasetSplit.txt
│   │   │       ├── dictionary.txt
│   │   │       ├── original_rt_snippets.txt
│   │   │       └── sentiment_labels.txt
│   │   ├── gradcheck.py
│   │   ├── treebank.py
│   │   └── utils.py
│   └── word2vec.py
├── a3/
│   ├── README.txt
│   ├── collect_submission.sh
│   ├── data/
│   │   ├── dev.conll
│   │   ├── dev.gold.conll
│   │   ├── en-cw.txt
│   │   ├── test.conll
│   │   ├── test.gold.conll
│   │   ├── train.conll
│   │   └── train.gold.conll
│   ├── local_env.yml
│   ├── parser_model.py
│   ├── parser_transitions.py
│   ├── results/
│   │   └── 20210202_021220/
│   │       └── model.weights
│   ├── run.py
│   └── utils/
│       ├── __init__.py
│       ├── general_utils.py
│       └── parser_utils.py
├── a4/
│   ├── README.md
│   ├── __init__.py
│   ├── chr_en_data/
│   │   ├── dev.chr
│   │   ├── dev.en
│   │   ├── test.chr
│   │   ├── test.en
│   │   ├── train.chr
│   │   └── train.en
│   ├── collect_submission.bat
│   ├── collect_submission.sh
│   ├── gpu_requirements.txt
│   ├── local_env.yml
│   ├── model_embeddings.py
│   ├── nmt_model.py
│   ├── outputs/
│   │   ├── .gitignore
│   │   └── test_outputs.txt
│   ├── run.bat
│   ├── run.py
│   ├── run.sh
│   ├── sanity_check.py
│   ├── sanity_check_en_es_data/
│   │   ├── Ybar_t.pkl
│   │   ├── combined_outputs.pkl
│   │   ├── dec_init_state.pkl
│   │   ├── dec_state.pkl
│   │   ├── dev_sanity_check.en
│   │   ├── dev_sanity_check.es
│   │   ├── e_t.pkl
│   │   ├── enc_hiddens.pkl
│   │   ├── enc_hiddens_proj.pkl
│   │   ├── enc_masks.pkl
│   │   ├── o_t.pkl
│   │   ├── step_dec_state_0.pkl
│   │   ├── step_dec_state_1.pkl
│   │   ├── step_dec_state_10.pkl
│   │   ├── step_dec_state_11.pkl
│   │   ├── step_dec_state_12.pkl
│   │   ├── step_dec_state_13.pkl
│   │   ├── step_dec_state_14.pkl
│   │   ├── step_dec_state_15.pkl
│   │   ├── step_dec_state_16.pkl
│   │   ├── step_dec_state_17.pkl
│   │   ├── step_dec_state_18.pkl
│   │   ├── step_dec_state_19.pkl
│   │   ├── step_dec_state_2.pkl
│   │   ├── step_dec_state_20.pkl
│   │   ├── step_dec_state_21.pkl
│   │   ├── step_dec_state_22.pkl
│   │   ├── step_dec_state_3.pkl
│   │   ├── step_dec_state_4.pkl
│   │   ├── step_dec_state_5.pkl
│   │   ├── step_dec_state_6.pkl
│   │   ├── step_dec_state_7.pkl
│   │   ├── step_dec_state_8.pkl
│   │   ├── step_dec_state_9.pkl
│   │   ├── step_o_t_0.pkl
│   │   ├── step_o_t_1.pkl
│   │   ├── step_o_t_10.pkl
│   │   ├── step_o_t_11.pkl
│   │   ├── step_o_t_12.pkl
│   │   ├── step_o_t_13.pkl
│   │   ├── step_o_t_14.pkl
│   │   ├── step_o_t_15.pkl
│   │   ├── step_o_t_16.pkl
│   │   ├── step_o_t_17.pkl
│   │   ├── step_o_t_18.pkl
│   │   ├── step_o_t_19.pkl
│   │   ├── step_o_t_2.pkl
│   │   ├── step_o_t_20.pkl
│   │   ├── step_o_t_21.pkl
│   │   ├── step_o_t_22.pkl
│   │   ├── step_o_t_3.pkl
│   │   ├── step_o_t_4.pkl
│   │   ├── step_o_t_5.pkl
│   │   ├── step_o_t_6.pkl
│   │   ├── step_o_t_7.pkl
│   │   ├── step_o_t_8.pkl
│   │   ├── step_o_t_9.pkl
│   │   ├── target_padded.pkl
│   │   ├── test_sanity_check.en
│   │   ├── test_sanity_check.es
│   │   ├── train_sanity_check.en
│   │   ├── train_sanity_check.es
│   │   └── vocab_sanity_check.json
│   ├── src.model
│   ├── src.vocab
│   ├── test_outputs.txt
│   ├── tgt.model
│   ├── tgt.vocab
│   ├── utils.py
│   ├── vocab.json
│   └── vocab.py
└── a5/
    ├── birth_dev.tsv
    ├── birth_places_train.tsv
    ├── birth_test_inputs.tsv
    ├── collect_submission.sh
    ├── d_cmd
    ├── f_cmd
    ├── g_cmd
    ├── mingpt-demo/
    │   ├── .ipynb_checkpoints/
    │   │   └── play_char-checkpoint.ipynb
    │   ├── LICENSE
    │   ├── README.md
    │   ├── mingpt/
    │   │   ├── __init__.py
    │   │   ├── model.py
    │   │   ├── trainer.py
    │   │   └── utils.py
    │   └── play_char.ipynb
    ├── src/
    │   ├── attention.py
    │   ├── dataset.py
    │   ├── london_baseline.py
    │   ├── model.py
    │   ├── run.py
    │   ├── trainer.py
    │   └── utils.py
    ├── synthesizer.finetune.params
    ├── synthesizer.pretrain.dev.predictions
    ├── synthesizer.pretrain.params
    ├── synthesizer.pretrain.test.predictions
    ├── vanilla.finetune.params
    ├── vanilla.model.params
    ├── vanilla.nopretrain.dev.predictions
    ├── vanilla.nopretrain.test.predictions
    ├── vanilla.pretrain.dev.predictions
    ├── vanilla.pretrain.params
    ├── vanilla.pretrain.test.predictions
    └── wiki.txt

Download .txt

SYMBOL INDEX (197 symbols across 24 files)

FILE: a2/sgd.py
  function load_saved_params (line 12) | def load_saved_params():
  function save_params (line 34) | def save_params(iter, params):
  function sgd (line 41) | def sgd(f, x0, step, iterations, postprocessing=None, useSaved=False,
  function sanity_check (line 114) | def sanity_check():

FILE: a2/utils/gradcheck.py
  function gradcheck_naive (line 8) | def gradcheck_naive(f, x, gradientText):
  function grad_tests_softmax (line 60) | def grad_tests_softmax(skipgram, dummy_tokens, dummy_vectors, dataset):
  function grad_tests_negsamp (line 137) | def grad_tests_negsamp(skipgram, dummy_tokens, dummy_vectors, dataset, n...

FILE: a2/utils/treebank.py
  class StanfordSentiment (line 9) | class StanfordSentiment:
    method __init__ (line 10) | def __init__(self, path=None, tablesize = 1000000):
    method tokens (line 17) | def tokens(self):
    method sentences (line 49) | def sentences(self):
    method numSentences (line 71) | def numSentences(self):
    method allSentences (line 78) | def allSentences(self):
    method getRandomContext (line 95) | def getRandomContext(self, C=5):
    method sent_labels (line 113) | def sent_labels(self):
    method dataset_split (line 150) | def dataset_split(self):
    method getRandomTrainSentence (line 168) | def getRandomTrainSentence(self):
    method categorify (line 173) | def categorify(self, label):
    method getDevSentences (line 185) | def getDevSentences(self):
    method getTestSentences (line 188) | def getTestSentences(self):
    method getTrainSentences (line 191) | def getTrainSentences(self):
    method getSplitSentences (line 194) | def getSplitSentences(self, split=0):
    method sampleTable (line 198) | def sampleTable(self):
    method rejectProb (line 230) | def rejectProb(self):
    method sampleTokenIdx (line 247) | def sampleTokenIdx(self):

FILE: a2/utils/utils.py
  function normalizeRows (line 5) | def normalizeRows(x):
  function softmax (line 15) | def softmax(x):

FILE: a2/word2vec.py
  function sigmoid (line 11) | def sigmoid(x):
  function naiveSoftmaxLossAndGradient (line 27) | def naiveSoftmaxLossAndGradient(
  function getNegativeSamples (line 99) | def getNegativeSamples(outsideWordIdx, dataset, K):
  function negSamplingLossAndGradient (line 111) | def negSamplingLossAndGradient(
  function skipgram (line 163) | def skipgram(currentCenterWord, windowSize, outsideWords, word2Ind,
  function word2vec_sgd_wrapper (line 230) | def word2vec_sgd_wrapper(word2vecModel, word2Ind, wordVectors, dataset,
  function test_sigmoid (line 253) | def test_sigmoid():
  function getDummyObjects (line 261) | def getDummyObjects():
  function test_naiveSoftmaxLossAndGradient (line 283) | def test_naiveSoftmaxLossAndGradient():
  function test_negSamplingLossAndGradient (line 299) | def test_negSamplingLossAndGradient():
  function test_skipgram (line 315) | def test_skipgram():
  function test_word2vec (line 331) | def test_word2vec():

FILE: a3/parser_model.py
  class ParserModel (line 16) | class ParserModel(nn.Module):
    method __init__ (line 33) | def __init__(self, embeddings, n_features=36,
    method embedding_lookup (line 84) | def embedding_lookup(self, w):
    method forward (line 122) | def forward(self, w):
  function check_embedding (line 172) | def check_embedding():
  function check_forward (line 178) | def check_forward():

FILE: a3/parser_transitions.py
  class PartialParse (line 12) | class PartialParse(object):
    method __init__ (line 13) | def __init__(self, sentence):
    method parse_step (line 43) | def parse_step(self, transition):
    method parse (line 72) | def parse(self, transitions):
  function minibatch_parse (line 86) | def minibatch_parse(sentences, model, batch_size):
  function test_step (line 149) | def test_step(name, transition, stack, buf, deps,
  function test_parse_step (line 166) | def test_parse_step():
  function test_parse (line 178) | def test_parse():
  class DummyModel (line 193) | class DummyModel(object):
    method __init__ (line 196) | def __init__(self, mode = "unidirectional"):
    method predict (line 199) | def predict(self, partial_parses):
    method unidirectional_predict (line 207) | def unidirectional_predict(self, partial_parses):
    method interleave_predict (line 214) | def interleave_predict(self, partial_parses):
  function test_dependencies (line 220) | def test_dependencies(name, deps, ex_deps):
  function test_minibatch_parse (line 227) | def test_minibatch_parse():

FILE: a3/run.py
  function train (line 30) | def train(parser, train_data, dev_data, output_path, batch_size=1024, n_...
  function train_for_epoch (line 71) | def train_for_epoch(parser, train_data, dev_data, optimizer, loss_func, ...

FILE: a3/utils/general_utils.py
  function get_minibatches (line 14) | def get_minibatches(data, minibatch_size, shuffle=True):
  function _minibatch (line 52) | def _minibatch(data, minibatch_idx):
  function test_all_close (line 56) | def test_all_close(name, actual, expected):

FILE: a3/utils/parser_utils.py
  class Config (line 27) | class Config(object):
  class Parser (line 42) | class Parser(object):
    method __init__ (line 45) | def __init__(self, dataset):
    method vectorize (line 97) | def vectorize(self, examples):
    method extract_features (line 111) | def extract_features(self, stack, buf, arcs, ex):
    method get_oracle (line 171) | def get_oracle(self, stack, buf, ex):
    method create_instances (line 199) | def create_instances(self, examples):
    method legal_labels (line 233) | def legal_labels(self, stack, buf):
    method parse (line 239) | def parse(self, dataset, eval_batch_size=5000):
  class ModelWrapper (line 269) | class ModelWrapper(object):
    method __init__ (line 270) | def __init__(self, parser, dataset, sentence_id_to_idx):
    method predict (line 275) | def predict(self, partial_parses):
  function read_conll (line 290) | def read_conll(in_file, lowercase=False, max_example=None):
  function build_dict (line 312) | def build_dict(keys, n_max=None, offset=0):
  function punct (line 322) | def punct(language, pos):
  function minibatches (line 342) | def minibatches(data, batch_size):
  function load_and_preprocess_data (line 350) | def load_and_preprocess_data(reduced=True):
  class AverageMeter (line 403) | class AverageMeter(object):
    method __init__ (line 405) | def __init__(self):
    method reset (line 408) | def reset(self):
    method update (line 414) | def update(self, val, n=1):

FILE: a4/model_embeddings.py
  class ModelEmbeddings (line 15) | class ModelEmbeddings(nn.Module):
    method __init__ (line 19) | def __init__(self, embed_size, vocab):

FILE: a4/nmt_model.py
  class NMT (line 24) | class NMT(nn.Module):
    method __init__ (line 30) | def __init__(self, embed_size, hidden_size, vocab, dropout_rate=0.2):
    method forward (line 92) | def forward(self, source: List[List[str]], target: List[List[str]]) ->...
    method encode (line 131) | def encode(self, source_padded: torch.Tensor, source_lengths: List[int...
    method decode (line 191) | def decode(self, enc_hiddens: torch.Tensor, enc_masks: torch.Tensor,
    method step (line 270) | def step(self, Ybar_t: torch.Tensor,
    method generate_sent_masks (line 371) | def generate_sent_masks(self, enc_hiddens: torch.Tensor, source_length...
    method beam_search (line 387) | def beam_search(self, src_sent: List[str], beam_size: int=5, max_decod...
    method device (line 479) | def device(self) -> torch.device:
    method load (line 485) | def load(model_path: str):
    method save (line 496) | def save(self, path: str):

FILE: a4/run.py
  function evaluate_ppl (line 64) | def evaluate_ppl(model, dev_data, batch_size=32):
  function compute_corpus_level_bleu_score (line 94) | def compute_corpus_level_bleu_score(references: List[List[str]], hypothe...
  function train (line 114) | def train(args: Dict):
  function decode (line 277) | def decode(args: Dict[str, str]):
  function beam_search (line 313) | def beam_search(model: NMT, test_data_src: List[List[str]], beam_size: i...
  function main (line 336) | def main():

FILE: a4/sanity_check.py
  function reinitialize_layers (line 44) | def reinitialize_layers(model):
  function generate_outputs (line 60) | def generate_outputs(model, source, target, vocab):
  function question_1d_sanity_check (line 104) | def question_1d_sanity_check(model, src_sents, tgt_sents, vocab):
  function question_1e_sanity_check (line 137) | def question_1e_sanity_check(model, src_sents, tgt_sents, vocab):
  function question_1f_sanity_check (line 175) | def question_1f_sanity_check(model, src_sents, tgt_sents, vocab):
  function sanity_read_corpus (line 214) | def sanity_read_corpus(file_path, source):
  function main (line 231) | def main():

FILE: a4/utils.py
  function pad_sents (line 24) | def pad_sents(sents, pad_token):
  function read_corpus (line 46) | def read_corpus(file_path, source, vocab_size=2500):
  function autograder_read_corpus (line 69) | def autograder_read_corpus(file_path, source):
  function batch_iter (line 86) | def batch_iter(data, batch_size, shuffle=False):

FILE: a4/vocab.py
  class VocabEntry (line 32) | class VocabEntry(object):
    method __init__ (line 36) | def __init__(self, word2id=None):
    method __getitem__ (line 51) | def __getitem__(self, word):
    method __contains__ (line 59) | def __contains__(self, word):
    method __setitem__ (line 66) | def __setitem__(self, key, value):
    method __len__ (line 71) | def __len__(self):
    method __repr__ (line 77) | def __repr__(self):
    method id2word (line 83) | def id2word(self, wid):
    method add (line 90) | def add(self, word):
    method words2indices (line 102) | def words2indices(self, sents):
    method indices2words (line 113) | def indices2words(self, word_ids):
    method to_input_tensor (line 120) | def to_input_tensor(self, sents: List[List[str]], device: torch.device...
    method from_corpus (line 135) | def from_corpus(corpus, size, freq_cutoff=2):
    method from_subword_list (line 153) | def from_subword_list(subword_list):
  class Vocab (line 160) | class Vocab(object):
    method __init__ (line 163) | def __init__(self, src_vocab: VocabEntry, tgt_vocab: VocabEntry):
    method build (line 172) | def build(src_sents, tgt_sents) -> 'Vocab':
    method save (line 186) | def save(self, file_path):
    method load (line 194) | def load(file_path):
    method __repr__ (line 205) | def __repr__(self):
  function get_vocab_list (line 212) | def get_vocab_list(file_path, source, vocab_size):

FILE: a5/mingpt-demo/mingpt/model.py
  class GPTConfig (line 19) | class GPTConfig:
    method __init__ (line 25) | def __init__(self, vocab_size, block_size, **kwargs):
  class GPT1Config (line 31) | class GPT1Config(GPTConfig):
  class CausalSelfAttention (line 37) | class CausalSelfAttention(nn.Module):
    method __init__ (line 44) | def __init__(self, config):
    method forward (line 61) | def forward(self, x, layer_past=None):
  class Block (line 81) | class Block(nn.Module):
    method __init__ (line 84) | def __init__(self, config):
    method forward (line 96) | def forward(self, x):
  class GPT (line 101) | class GPT(nn.Module):
    method __init__ (line 104) | def __init__(self, config):
    method get_block_size (line 122) | def get_block_size(self):
    method _init_weights (line 125) | def _init_weights(self, module):
    method configure_optimizers (line 134) | def configure_optimizers(self, train_config):
    method forward (line 180) | def forward(self, idx, targets=None):

FILE: a5/mingpt-demo/mingpt/trainer.py
  class TrainerConfig (line 19) | class TrainerConfig:
    method __init__ (line 35) | def __init__(self, **kwargs):
  class Trainer (line 39) | class Trainer:
    method __init__ (line 41) | def __init__(self, model, train_dataset, test_dataset, config):
    method save_checkpoint (line 53) | def save_checkpoint(self):
    method train (line 59) | def train(self):

FILE: a5/mingpt-demo/mingpt/utils.py
  function set_seed (line 7) | def set_seed(seed):
  function top_k_logits (line 13) | def top_k_logits(logits, k):
  function sample (line 20) | def sample(model, x, steps, temperature=1.0, sample=False, top_k=None):

FILE: a5/src/attention.py
  class CausalSelfAttention (line 10) | class CausalSelfAttention(nn.Module):
    method __init__ (line 17) | def __init__(self, config):
    method forward (line 34) | def forward(self, x, layer_past=None):
  class SynthesizerAttention (line 71) | class SynthesizerAttention(nn.Module):
    method __init__ (line 72) | def __init__(self, config):
    method forward (line 97) | def forward(self, x, layer_past=None):

FILE: a5/src/dataset.py
  class NameDataset (line 24) | class NameDataset(Dataset):
    method __init__ (line 25) | def __init__(self, pretraining_dataset, data):
    method __len__ (line 33) | def __len__(self):
    method __getitem__ (line 37) | def __getitem__(self, idx):
  class CharCorruptionDataset (line 144) | class CharCorruptionDataset(Dataset):
    method __init__ (line 145) | def __init__(self, data, block_size):
    method __len__ (line 165) | def __len__(self):
    method __getitem__ (line 169) | def __getitem__(self, idx):

FILE: a5/src/model.py
  class GPTConfig (line 20) | class GPTConfig:
    method __init__ (line 27) | def __init__(self, vocab_size, block_size, synthesizer=False, **kwargs):
  class GPT1Config (line 34) | class GPT1Config(GPTConfig):
  class Block (line 40) | class Block(nn.Module):
    method __init__ (line 43) | def __init__(self, config):
    method forward (line 60) | def forward(self, x):
  class GPT (line 65) | class GPT(nn.Module):
    method __init__ (line 68) | def __init__(self, config):
    method _init_weights (line 86) | def _init_weights(self, module):
    method get_block_size (line 95) | def get_block_size(self):
    method forward (line 98) | def forward(self, idx, targets=None):
  class CustomLayerNorm (line 117) | class CustomLayerNorm(nn.Module):

FILE: a5/src/trainer.py
  class TrainerConfig (line 21) | class TrainerConfig:
    method __init__ (line 37) | def __init__(self, **kwargs):
  class Trainer (line 41) | class Trainer:
    method __init__ (line 43) | def __init__(self, model, train_dataset, test_dataset, config):
    method save_checkpoint (line 55) | def save_checkpoint(self):
    method train (line 61) | def train(self):

FILE: a5/src/utils.py
  function set_seed (line 12) | def set_seed(seed):
  function top_k_logits (line 18) | def top_k_logits(logits, k):
  function sample (line 25) | def sample(model, x, steps, temperature=1.0, sample=False, top_k=None):
  function evaluate_places (line 55) | def evaluate_places(filepath, predicted_places):

Copy disabled (too large) Download .json

Condensed preview — 183 files, each showing path, character count, and a content snippet. Download the .json file for the full structured content (24,960K chars).

[
  {
    "path": "README.md",
    "chars": 12864,
    "preview": "# Stanford Course CS224n - Natural Language Processing with Deep Learning (Winter 2021)\n\nThese are my solutions to the a"
  },
  {
    "path": "a1/.ipynb_checkpoints/exploring_word_vectors-checkpoint.ipynb",
    "chars": 115085,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# CS224N Assignment 1: Exploring Wo"
  },
  {
    "path": "a1/Gensim word vector visualization.ipynb",
    "chars": 7118,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Gensim word vector visualization "
  },
  {
    "path": "a1/README.txt",
    "chars": 1317,
    "preview": "Welcome to CS224N!\n\nWe'll be using Python throughout the course. If you've got a good Python setup already, great! But m"
  },
  {
    "path": "a1/env.yml",
    "chars": 168,
    "preview": "name: cs224n\nchannels:\n  - defaults\n  - anaconda\ndependencies:\n  - jupyter\n  - matplotlib\n  - numpy\n  - python=3.7\n  - i"
  },
  {
    "path": "a1/exploring_word_vectors.ipynb",
    "chars": 46171,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# CS224N Assignment 1: Exploring Wo"
  },
  {
    "path": "a1/exploring_word_vectors_solved.ipynb",
    "chars": 115099,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# CS224N Assignment 1: Exploring Wo"
  },
  {
    "path": "a2/collect_submission.sh",
    "chars": 79,
    "preview": "rm -f assignment2.zip\nzip -r assignment2.zip *.py *.png saved_params_40000.npy\n"
  },
  {
    "path": "a2/env.yml",
    "chars": 128,
    "preview": "name: a2\nchannels:\n  - defaults\n  - anaconda\ndependencies:\n  - jupyter\n  - matplotlib\n  - numpy\n  - python=3.7\n  - sciki"
  },
  {
    "path": "a2/get_datasets.sh",
    "chars": 403,
    "preview": "#!/bin/bash\n\nDATASETS_DIR=\"utils/datasets\"\nmkdir -p $DATASETS_DIR\n\ncd $DATASETS_DIR\n\n# Get Stanford Sentiment Treebank\ni"
  },
  {
    "path": "a2/run.py",
    "chars": 2282,
    "preview": "#!/usr/bin/env python\n\nimport random\nimport numpy as np\nfrom utils.treebank import StanfordSentiment\nimport matplotlib\nm"
  },
  {
    "path": "a2/sgd.py",
    "chars": 3643,
    "preview": "#!/usr/bin/env python\n\n# Save parameters every a few SGD iterations as fail-safe\nSAVE_PARAMS_EVERY = 5000\n\nimport pickle"
  },
  {
    "path": "a2/utils/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "a2/utils/datasets/stanfordSentimentTreebank/README.txt",
    "chars": 2357,
    "preview": "Stanford Sentiment Treebank V1.0\n\nThis is the dataset of the paper:\n\nRecursive Deep Models for Semantic Compositionality"
  },
  {
    "path": "a2/utils/datasets/stanfordSentimentTreebank/SOStr.txt",
    "chars": 1225912,
    "preview": "The|Rock|is|destined|to|be|the|21st|Century|'s|new|``|Conan|''|and|that|he|'s|going|to|make|a|splash|even|greater|than|A"
  },
  {
    "path": "a2/utils/datasets/stanfordSentimentTreebank/STree.txt",
    "chars": 1308918,
    "preview": "70|70|68|67|63|62|61|60|58|58|57|56|56|64|65|55|54|53|52|51|49|47|47|46|46|45|40|40|41|39|38|38|43|37|37|69|44|39|42|41|"
  },
  {
    "path": "a2/utils/datasets/stanfordSentimentTreebank/datasetSentences.txt",
    "chars": 1290029,
    "preview": "sentence_index\tsentence\n1\tThe Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a "
  },
  {
    "path": "a2/utils/datasets/stanfordSentimentTreebank/datasetSplit.txt",
    "chars": 83764,
    "preview": "sentence_index,splitset_label\n1,1\n2,1\n3,2\n4,2\n5,2\n6,2\n7,2\n8,2\n9,2\n10,2\n11,2\n12,2\n13,2\n14,2\n15,2\n16,2\n17,2\n18,2\n19,2\n20,2"
  },
  {
    "path": "a2/utils/datasets/stanfordSentimentTreebank/original_rt_snippets.txt",
    "chars": 1195427,
    "preview": "The Rock is destined to be the 21st Century's new ``Conan'' and that he's going to make a splash even greater than Arnol"
  },
  {
    "path": "a2/utils/datasets/stanfordSentimentTreebank/sentiment_labels.txt",
    "chars": 3263577,
    "preview": "phrase ids|sentiment values\n0|0.5\n1|0.5\n2|0.44444\n3|0.5\n4|0.42708\n5|0.375\n6|0.41667\n7|0.54167\n8|0.33333\n9|0.45833\n10|0.4"
  },
  {
    "path": "a2/utils/gradcheck.py",
    "chars": 11737,
    "preview": "#!/usr/bin/env python\n\nimport numpy as np\nimport random\n\n\n# First implement a gradient checker by filling in the followi"
  },
  {
    "path": "a2/utils/treebank.py",
    "chars": 7544,
    "preview": "#!/usr/bin/env python\n# -*- coding: utf-8 -*-\n\nimport pickle\nimport numpy as np\nimport os\nimport random\n\nclass StanfordS"
  },
  {
    "path": "a2/utils/utils.py",
    "chars": 1051,
    "preview": "#!/usr/bin/env python\n\nimport numpy as np\n\ndef normalizeRows(x):\n    \"\"\" Row normalization function\n\n    Implement a fun"
  },
  {
    "path": "a2/word2vec.py",
    "chars": 13786,
    "preview": "#!/usr/bin/env python\n\nimport argparse\nimport numpy as np\nimport random\n\nfrom utils.gradcheck import gradcheck_naive, gr"
  },
  {
    "path": "a3/README.txt",
    "chars": 1087,
    "preview": "Welcome to Assignment 3!\n\nWe'll be using PyTorch for this assignment. If you're not familiar with PyTorch, or if you wou"
  },
  {
    "path": "a3/collect_submission.sh",
    "chars": 65,
    "preview": "rm -f assignment3.zip\nzip -r assignment3.zip *.py ./data ./utils\n"
  },
  {
    "path": "a3/data/dev.conll",
    "chars": 1306509,
    "preview": "1\tInfluential\t_\tADJ\tJJ\t_\t2\tamod\t_\t_\n2\tmembers\t_\tNOUN\tNNS\t_\t10\tnsubj\t_\t_\n3\tof\t_\tADP\tIN\t_\t6\tcase\t_\t_\n4\tthe\t_\tDET\tDT\t_\t6\tde"
  },
  {
    "path": "a3/data/dev.gold.conll",
    "chars": 1306537,
    "preview": "1\tInfluential\t_\tADJ\tJJ\t_\t2\tamod\t_\t_\n2\tmembers\t_\tNOUN\tNNS\t_\t10\tnsubj\t_\t_\n3\tof\t_\tADP\tIN\t_\t6\tcase\t_\t_\n4\tthe\t_\tDET\tDT\t_\t6\tde"
  },
  {
    "path": "a3/data/test.conll",
    "chars": 1846748,
    "preview": "1\tNo\t_\tADV\tDT\t_\t7\tdiscourse\t_\t_\n2\t,\t_\tPUNCT\t,\t_\t7\tpunct\t_\t_\n3\tit\t_\tPRON\tPRP\t_\t7\tnsubj\t_\t_\n4\twas\t_\tVERB\tVBD\t_\t7\tcop\t_\t_\n5"
  },
  {
    "path": "a3/data/test.gold.conll",
    "chars": 1846564,
    "preview": "1\tNo\t_\tADV\tRB\t_\t7\tdiscourse\t_\t_\n2\t,\t_\tPUNCT\t,\t_\t7\tpunct\t_\t_\n3\tit\t_\tPRON\tPRP\t_\t7\tnsubj\t_\t_\n4\twas\t_\tVERB\tVBD\t_\t7\tcop\t_\t_\n5"
  },
  {
    "path": "a3/local_env.yml",
    "chars": 138,
    "preview": "name: cs224n_a3\nchannels:\n  - pytorch\n  - defaults\ndependencies:\n  - python=3.7\n  - numpy\n  - tqdm\n  - docopt\n  - pytorc"
  },
  {
    "path": "a3/parser_model.py",
    "chars": 10012,
    "preview": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\"\"\"\nCS224N 2020-2021: Homework 3\nparser_model.py: Feed-Forward Neural Net"
  },
  {
    "path": "a3/parser_transitions.py",
    "chars": 12467,
    "preview": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\"\"\"\nCS224N 2020-2021: Homework 3\nparser_transitions.py: Algorithms for co"
  },
  {
    "path": "a3/run.py",
    "chars": 6427,
    "preview": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\"\"\"\nCS224N 2020-2021: Homework 3\nrun.py: Run the dependency parser.\nSahil"
  },
  {
    "path": "a3/utils/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "a3/utils/general_utils.py",
    "chars": 2448,
    "preview": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\"\"\"\nCS224N 2020-2021: Homework 3\ngeneral_utils.py: General purpose utilit"
  },
  {
    "path": "a3/utils/parser_utils.py",
    "chars": 16287,
    "preview": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\"\"\"\nCS224N 2020-2021: Homework 3\nparser_utils.py: Utilities for training "
  },
  {
    "path": "a4/README.md",
    "chars": 95,
    "preview": "# NMT Assignment\nNote: Heavily inspired by the https://github.com/pcyin/pytorch_nmt repository\n"
  },
  {
    "path": "a4/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "a4/chr_en_data/dev.chr",
    "chars": 61677,
    "preview": "ᏉᎳ, ᎠᎴ ᏌᏱᎳ, ᎠᎴ ᏗᎹᏗ, ᏫᏨᏲᏪᎳᏏ ᏗᏣᏁᎶᏗ ᎢᏣᏓᏡᎬ ᏕᏏᎶᏂᎦ ᎢᏤᎯ ᎤᏁᎳᏅᎯ ᎢᎩᏙᏓ ᎠᎴ ᎤᎬᏫᏳᎯ ᏥᏌ ᎦᎶᏁᏛ ᏕᏣᏁᎶᏛᎢ.\nᎤᏂᏣᏘᏃ ᎤᎾᏛᎦᏅ ᎤᏂᏍᏆᏂᎪᏒᎩ ᏄᏍᏛ ᏓᏕᏲᎲᏍᎬᎢ.\nᎾ"
  },
  {
    "path": "a4/chr_en_data/dev.en",
    "chars": 97843,
    "preview": "Paul, and Silvanus, and Timothy, unto the church of the Thessalonians in God our Father and the Lord Jesus Christ;\nAnd w"
  },
  {
    "path": "a4/chr_en_data/test.chr",
    "chars": 62209,
    "preview": "ᎡᏆᎭᎻ ᎯᎠ ᏄᏪᏎᎴᎢ; ᎼᏏ ᎠᎴ ᎠᎾᏙᎴᎰᏍᎩ ᏚᏁᎭ; ᎾᏍᎩ ᏫᏓᎾᏛᏓᏍᏓᏏ.\n“ᎭᏩ ᏍᎩᏉᏗ ᏄᏍᏕᏍᏗ, ᎡᏚᏥ.” ᎤᏛᏁ ᏌᎳᏓ.\nᎾᏍᎩ ᎢᏳᏍᏗ ᏞᏍᏗ ᎢᏣᏓᏄᎸᏗᏉ ᏱᎨᏎᏍᏗ, ᏗᏥᏍᏓᏩᏕᎩᏍᎩᏂ ᎨᏎ"
  },
  {
    "path": "a4/chr_en_data/test.en",
    "chars": 100315,
    "preview": "But Abraham saith, They have Moses and the prophets; let them hear them.\n“Very well, Uncle,»” replied Charlotte.\nthat ye"
  },
  {
    "path": "a4/chr_en_data/train.chr",
    "chars": 985933,
    "preview": "ᎠᎴ ᎾᏍᎩ ᏅᏗᎦᎵᏍᏙᏗᎭ ᏂᎦᏛ ᎠᏃᎯᏳᎲᏍᎩ ᎠᎾᏚᏓᎴᎭ ᏂᎦᎥ ᏧᏓᎴᏅᏛ, ᎾᏍᎩ ᎨᏣᏚᏓᎳᎡᏗ ᏂᎨᏒᎾ ᏥᎨᏒ ᎼᏏ ᎤᏤᎵ ᏗᎧᎿᏩᏛᏍᏗ ᏕᏥᎧᎿᏩᏗᏒ ᎢᏳᏍᏗ.\nᎤᏯᎪᏅ ᎠᏰᎸᎢ, ᎠᏍᎪᎵ ᎤᏍᏆᏄᏤ ᎫᏕ"
  },
  {
    "path": "a4/chr_en_data/train.en",
    "chars": 1574489,
    "preview": "and by him every one that believeth is justified from all things, from which ye could not be justified by the law of Mos"
  },
  {
    "path": "a4/collect_submission.bat",
    "chars": 113,
    "preview": "@echo off\r\ndel /f assignment4.zip\r\ntar.exe -r -f assignment4.zip *.py chr_en_data sanity_check_en_es_data outputs"
  },
  {
    "path": "a4/collect_submission.sh",
    "chars": 99,
    "preview": "rm -f assignment4.zip\nzip -r assignment4.zip *.py ./chr_en_data ./sanity_check_en_es_data ./outputs"
  },
  {
    "path": "a4/gpu_requirements.txt",
    "chars": 55,
    "preview": "nltk\ndocopt\ntqdm==4.29.1\nsentencepiece\nsacrebleu\ntorch\n"
  },
  {
    "path": "a4/local_env.yml",
    "chars": 210,
    "preview": "name: local_nmt\nchannels:\n  - pytorch\n  - defaults\ndependencies:\n  - python=3.7\n  - numpy\n  - scipy\n  - tqdm\n  - docopt\n"
  },
  {
    "path": "a4/model_embeddings.py",
    "chars": 2087,
    "preview": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n\"\"\"\nCS224N 2020-21: Homework 4\nmodel_embeddings.py: Embeddings for the N"
  },
  {
    "path": "a4/nmt_model.py",
    "chars": 28325,
    "preview": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n\"\"\"\nCS224N 2020-21: Homework 4\nnmt_model.py: NMT Model\nPencheng Yin <pcy"
  },
  {
    "path": "a4/outputs/.gitignore",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "a4/outputs/test_outputs.txt",
    "chars": 82167,
    "preview": " The Abraham said unto him, Moses and prophets; that he may be with them.\n “Well,” said Charlotte.\n Therefore shall not "
  },
  {
    "path": "a4/run.bat",
    "chars": 1210,
    "preview": "@echo off\nrem    Run this file on the command line of an environment that contains \"python\" in path\nrem    For example, "
  },
  {
    "path": "a4/run.py",
    "chars": 15334,
    "preview": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n\"\"\"\nCS224N 2020-21: Homework 4\nrun.py: Run Script for Simple NMT Model\nP"
  },
  {
    "path": "a4/run.sh",
    "chars": 1014,
    "preview": "#!/bin/bash\n\nif [ \"$1\" = \"train\" ]; then\n\tCUDA_VISIBLE_DEVICES=0 python run.py train --train-src=./chr_en_data/train.chr"
  },
  {
    "path": "a4/sanity_check.py",
    "chars": 12047,
    "preview": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n\"\"\"\nCS224N 2019-20: Homework 4\nsanity_check.py: sanity checks for assign"
  },
  {
    "path": "a4/sanity_check_en_es_data/dev_sanity_check.en",
    "chars": 2267,
    "preview": "And the question that I want to ask everybody here today  is are you guys all cool with that idea?\nBAA: Yes. What this i"
  },
  {
    "path": "a4/sanity_check_en_es_data/dev_sanity_check.es",
    "chars": 2212,
    "preview": "Y la pregunta que quiero hacerles a todos hoy es estn todos ustedes a gusto con esa idea?\nBAA: S. Lo que hace realmente "
  },
  {
    "path": "a4/sanity_check_en_es_data/test_sanity_check.en",
    "chars": 426,
    "preview": "Ironically, by borrowing out their voices,  I'm able to maintain a temporary form of currency,  kind of like taking out "
  },
  {
    "path": "a4/sanity_check_en_es_data/test_sanity_check.es",
    "chars": 413,
    "preview": "Irnicamente, al pedir prestadas sus voces, soy capaz de mantener una forma temporal de valor algo as como tomar un prsta"
  },
  {
    "path": "a4/sanity_check_en_es_data/train_sanity_check.en",
    "chars": 3950,
    "preview": "But what can you do? You're in the middle of the ocean.\nSo in this situation too, to decode the information contained in"
  },
  {
    "path": "a4/sanity_check_en_es_data/train_sanity_check.es",
    "chars": 3861,
    "preview": "Pero, qu puedes hacer? Ests en el medio del ocano.\nAs que en esta situacin tambin, para decodificar la informacin conten"
  },
  {
    "path": "a4/sanity_check_en_es_data/vocab_sanity_check.json",
    "chars": 2568,
    "preview": "{\n  \"src_word2id\": {\n    \"<pad>\": 0,\n    \"<s>\": 1,\n    \"</s>\": 2,\n    \"<unk>\": 3,\n    \"de\": 4,\n    \"que\": 5,\n    \"el\": 6"
  },
  {
    "path": "a4/src.vocab",
    "chars": 314808,
    "preview": "<unk>\t0\n<s>\t0\n</s>\t0\n,\t-2.54393\n.\t-3.04052\n▁ᎠᎴ\t-3.32008\n▁ᎾᏍᎩ\t-3.74462\n;\t-3.84657\nᏃ\t-4.05736\n▁ᎯᎠ\t-4.40361\n▁ᎨᏒ\t-4.69076\n▁\t"
  },
  {
    "path": "a4/test_outputs.txt",
    "chars": 82167,
    "preview": " The Abraham said unto him, Moses and prophets; that he may be with them.\n “Well,” said Charlotte.\n Therefore shall not "
  },
  {
    "path": "a4/tgt.vocab",
    "chars": 130799,
    "preview": "<unk>\t0\n<s>\t0\n</s>\t0\n,\t-2.7062\n▁the\t-2.96591\n▁and\t-3.31214\n.\t-3.37006\n▁of\t-3.64592\n▁to\t-4.15438\n▁that\t-4.31026\n▁in\t-4.35"
  },
  {
    "path": "a4/utils.py",
    "chars": 3587,
    "preview": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n\"\"\"\nCS224N 2020-21: Homework 4\nnmt.py: NMT Model\nPencheng Yin <pcyin@cs."
  },
  {
    "path": "a4/vocab.json",
    "chars": 1140845,
    "preview": "{\n  \"src_word2id\": {\n    \"<pad>\": 0,\n    \"<s>\": 1,\n    \"</s>\": 2,\n    \"<unk>\": 3,\n    \",\": 4,\n    \".\": 5,\n    \"\\u2581\\u1"
  },
  {
    "path": "a4/vocab.py",
    "chars": 8931,
    "preview": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n\"\"\"\nCS224N 2020-21: Homework 4\nvocab.py: Vocabulary Generation\nPencheng "
  },
  {
    "path": "a5/birth_dev.tsv",
    "chars": 20017,
    "preview": "Where was Bryan Dubreuiel born?\tAtlanta\nWhere was Ralf Wadephul born?\tBerlin\nWhere was Joseph Baggaley born?\tEngland\nWhe"
  },
  {
    "path": "a5/birth_places_train.tsv",
    "chars": 79765,
    "preview": "Where was Khatchig Mouradian born?\tLebanon\nWhere was Jacob Henry Studer born?\tColumbus\nWhere was John Stephen born?\tGlas"
  },
  {
    "path": "a5/birth_test_inputs.tsv",
    "chars": 13854,
    "preview": "Where was Adam Bright born?\nWhere was Alan Hess born?\nWhere was Jacob Guptil Fletcher born?\nWhere was Hisham Mohd Ashour"
  },
  {
    "path": "a5/collect_submission.sh",
    "chars": 404,
    "preview": "rm -f assignment5_submission.zip\nzip -r assignment5_submission.zip src/ birth_dev.tsv birth_places_train.tsv wiki.txt va"
  },
  {
    "path": "a5/d_cmd",
    "chars": 615,
    "preview": "# Train on the names dataset\npython src/run.py finetune vanilla wiki.txt --writing_params_path vanilla.model.params --fi"
  },
  {
    "path": "a5/f_cmd",
    "chars": 750,
    "preview": "# Pretrain the model\npython src/run.py pretrain vanilla wiki.txt --writing_params_path vanilla.pretrain.params\n\n# Finetu"
  },
  {
    "path": "a5/g_cmd",
    "chars": 797,
    "preview": "# Pretrain the model\npython src/run.py pretrain synthesizer wiki.txt --writing_params_path synthesizer.pretrain.params \n"
  },
  {
    "path": "a5/mingpt-demo/.ipynb_checkpoints/play_char-checkpoint.ipynb",
    "chars": 11483,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Train a character-level GPT on s"
  },
  {
    "path": "a5/mingpt-demo/LICENSE",
    "chars": 1081,
    "preview": "The MIT License (MIT) Copyright (c) 2020 Andrej Karpathy\n\nPermission is hereby granted, free of charge, to any person ob"
  },
  {
    "path": "a5/mingpt-demo/README.md",
    "chars": 8728,
    "preview": "\n# minGPT\n\n![mingpt](mingpt.jpg)\n\nA PyTorch re-implementation of [GPT](https://github.com/openai/gpt-3) training. minGPT"
  },
  {
    "path": "a5/mingpt-demo/mingpt/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "a5/mingpt-demo/mingpt/model.py",
    "chars": 8455,
    "preview": "\"\"\"\nGPT model:\n- the initial stem consists of a combination of token encoding and a positional encoding\n- the meat of it"
  },
  {
    "path": "a5/mingpt-demo/mingpt/trainer.py",
    "chars": 5268,
    "preview": "\"\"\"\nSimple training loop; Boilerplate that could apply to any arbitrary neural network,\nso nothing in this file really h"
  },
  {
    "path": "a5/mingpt-demo/mingpt/utils.py",
    "chars": 1718,
    "preview": "import random\nimport numpy as np\nimport torch\nimport torch.nn as nn\nfrom torch.nn import functional as F\n\ndef set_seed(s"
  },
  {
    "path": "a5/mingpt-demo/play_char.ipynb",
    "chars": 9161,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Train a character-level GPT on s"
  },
  {
    "path": "a5/src/attention.py",
    "chars": 6235,
    "preview": "import math\nimport logging\n\nimport torch\nimport torch.nn as nn\nfrom torch.nn import functional as F\n\nlogger = logging.ge"
  },
  {
    "path": "a5/src/dataset.py",
    "chars": 13143,
    "preview": "import random\nimport torch\nfrom torch.utils.data import Dataset\nimport argparse\n\n\"\"\"\nThe input-output pairs (x, y) of th"
  },
  {
    "path": "a5/src/london_baseline.py",
    "chars": 544,
    "preview": "# Calculate the accuracy of a baseline that simply predicts \"London\" for every\n#   example in the dev set.\n# Hint: Make "
  },
  {
    "path": "a5/src/model.py",
    "chars": 4022,
    "preview": "\n\"\"\"\nGPT model:\n- the initial stem consists of a combination of token encoding and a positional encoding\n- the meat of i"
  },
  {
    "path": "a5/src/run.py",
    "chars": 7753,
    "preview": "import numpy as np\nimport torch\nimport torch.nn as nn\nfrom tqdm import tqdm\nfrom torch.nn import functional as F\nimport "
  },
  {
    "path": "a5/src/trainer.py",
    "chars": 5202,
    "preview": "\"\"\"\nSimple training loop; Boilerplate that could apply to any arbitrary neural network,\nso nothing in this file really h"
  },
  {
    "path": "a5/src/utils.py",
    "chars": 2585,
    "preview": "\"\"\" Utilities; we suggest changing none of these functions\n\nbut feel free to add your own.\n\"\"\"\n\nimport random\nimport num"
  },
  {
    "path": "a5/synthesizer.pretrain.dev.predictions",
    "chars": 3960,
    "preview": "Montreal\nLondon\nSydney\nIran\nLondon\nLondon\nParis\nBristol\nSpain\nLondon\nChicago\nSofia\nLondon\nToronto\nLondon\nMelbourne\nMontr"
  },
  {
    "path": "a5/synthesizer.pretrain.test.predictions",
    "chars": 3443,
    "preview": "London\nChicago\nManchester\nLondon\nColombo\nParis\nBerlin\nColumbus\nParis\nCoventry\nColumbus\nBerlin\nLondon\nLondon\nChatham\nBerl"
  },
  {
    "path": "a5/vanilla.nopretrain.dev.predictions",
    "chars": 4009,
    "preview": "Paris\nParis\nMoscow\nAustralia\nBudapest\nLiverpool\nCambridge\nKerala\nNaples\nSheffield\nJamaica\nToronto\nWarsaw\nMontreal\nEnglan"
  },
  {
    "path": "a5/vanilla.nopretrain.test.predictions",
    "chars": 3428,
    "preview": "Sydney\nBudapest\nManchester\nKeral\nLondon\nRochester\nConcord\nMontreal\nParis\nSydney\nPoland\nLondon\nWinnipeg\nIstanbul\nEngland\n"
  },
  {
    "path": "a5/vanilla.pretrain.dev.predictions",
    "chars": 4101,
    "preview": "Montreal\nPrague\nVienna\nBangalore\nDetroit\nAmsterdam\nSydney\nKentucky\nBarcelona\nMontreal\nParis\nTurkey\nEdinburgh\nStockholm\nP"
  },
  {
    "path": "a5/vanilla.pretrain.test.predictions",
    "chars": 3575,
    "preview": "India\nCalifornia\nMaine\nBirmingham\nEdmonton\nPasadena\nBelgium\nManchester\nMarseille\nLondon\nMelbourne\nLeipzig\nCroydon\nPrague"
  },
  {
    "path": "a5/wiki.txt",
    "chars": 418352,
    "preview": "Khatchig Mouradian. Khatchig Mouradian is a journalist, writer and translator born in Lebanon .\nJacob Henry Studer. Jaco"
  }
]

// ... and 84 more files (download for full content)

About this extraction

This page contains the full source code of the amanchadha/stanford-cs224n-assignments-2021 GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 183 files (210.6 MB), approximately 5.1M tokens, and a symbol index with 197 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo