Repository: bastings/annotated_encoder_decoder
Branch: master
Commit: 622c65f2d880
Files: 6
Total size: 209.3 KB

Directory structure:
gitextract_t4r2cdzr/

├── .gitignore
├── LICENSE
├── README.md
├── _config.yml
├── annotated_encoder_decoder.ipynb
└── index.md

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/
.pytest_cache/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# pyenv
.python-version

# celery beat schedule file
celerybeat-schedule

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/


================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) 2018 Joost Bastings

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: README.md
================================================
# The Annotated Encoder Decoder with Attention

Read the [blog post](https://bastings.github.io/annotated_encoder_decoder/) or simply run the jupyter notebook from this repository.


================================================
FILE: _config.yml
================================================
title: The Annotated Encoder Decoder
description: A PyTorch tutorial implementing Bahdanau et al. (2015)
google_analytics: UA-126252625-1
show_downloads: true
theme: jekyll-theme-cayman
kramdown:
   math_engine: mathjax
   syntax_highlighter: rouge
   
gems:
  - jekyll-mentions


================================================
FILE: annotated_encoder_decoder.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# The Annotated Encoder-Decoder with Attention\n",
    "\n",
    "Recently, Alexander Rush wrote a blog post called [The Annotated Transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html), describing the Transformer model from the paper [Attention is All You Need](https://arxiv.org/abs/1706.03762). This post can be seen as a **prequel** to that: *we will implement an Encoder-Decoder with Attention* using (Gated) Recurrent Neural Networks, very closely following the original attention-based neural machine translation paper [\"Neural Machine Translation by Jointly Learning to Align and Translate\"](https://arxiv.org/abs/1409.0473) of Bahdanau et al. (2015). \n",
    "\n",
    "The idea is that going through both blog posts will make you familiar with two very influential sequence-to-sequence architectures. If you have any comments or suggestions, please let me know: [@BastingsJasmijn](https://twitter.com/BastingsJasmijn)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Model Architecture\n",
    "\n",
    "We will model the probability $p(Y\\mid X)$ of a target sequence $Y=(y_1, \\dots, y_{N})$ given a source sequence $X=(x_1, \\dots, x_M)$ directly with a neural network: an Encoder-Decoder.\n",
    "\n",
    "<img src=\"images/bahdanau.png\" width=\"636\">\n",
    "\n",
    "#### Encoder \n",
    "\n",
    "The encoder reads in the source sentence (*at the bottom of the figure*) and produces a sequence of hidden states $\\mathbf{h}_1, \\dots, \\mathbf{h}_M$, one for each source word. These states should capture the meaning of a word in its context of the given sentence.\n",
    "\n",
    "We will use a bi-directional recurrent neural network (Bi-RNN) as the encoder; a Bi-GRU in particular.\n",
    "\n",
    "First of all we **embed** the source words. \n",
    "We simply look up the **word embedding** for each word in a (randomly initialized) lookup table.\n",
    "We will denote the word embedding for word $i$ in a given sentence with $\\mathbf{x}_i$.\n",
    "By embedding words, our model may exploit the fact that certain words (e.g. *cat* and *dog*) are semantically similar, and can be processed in a similar way.\n",
    "\n",
    "Now, how do we get hidden states $\\mathbf{h}_1, \\dots, \\mathbf{h}_M$? A forward GRU reads the source sentence left-to-right, while a backward GRU reads it right-to-left.\n",
    "Each of them follows a simple recursive formula: \n",
    "$$\\mathbf{h}_j = \\text{GRU}( \\mathbf{x}_j , \\mathbf{h}_{j - 1} )$$\n",
    "i.e. we obtain the next state from the previous state and the current input word embedding.\n",
    "\n",
    "The hidden state of the forward GRU at time step $j$ will know what words **precede** the word at that time step, but it doesn't know what words will follow. In contrast, the backward GRU will only know what words **follow** the word at time step $j$. By **concatenating** those two hidden states (*shown in blue in the figure*), we get $\\mathbf{h}_j$, which captures word $j$ in its full sentence context.\n",
    "\n",
    "\n",
    "#### Decoder \n",
    "\n",
    "The decoder (*at the top of the figure*) is a GRU with hidden state $\\mathbf{s_i}$. It follows a similar formula to the encoder, but takes one extra input $\\mathbf{c}_{i}$ (*shown in yellow*).\n",
    "\n",
    "$$\\mathbf{s}_{i} = f( \\mathbf{s}_{i - 1}, \\mathbf{y}_{i - 1}, \\mathbf{c}_i )$$\n",
    "\n",
    "Here, $\\mathbf{y}_{i - 1}$ is the previously generated target word (*not shown*).\n",
    "\n",
    "At each time step, an **attention mechanism** dynamically selects that part of the source sentence that is most relevant for predicting the current target word. It does so by comparing the last decoder state with each source hidden state. The result is a context vector $\\mathbf{c_i}$ (*shown in yellow*).\n",
    "Later the attention mechanism is explained in more detail.\n",
    "\n",
    "After computing the decoder state $\\mathbf{s}_i$, a non-linear function $g$ (which applies a [softmax](https://en.wikipedia.org/wiki/Softmax_function)) gives us the probability of the target word $y_i$ for this time step:\n",
    "\n",
    "$$ p(y_i \\mid y_{<i}, x_1^M) = g(\\mathbf{s}_i, \\mathbf{c}_i, \\mathbf{y}_{i - 1})$$\n",
    "\n",
    "Because $g$ applies a softmax, it provides a vector the size of the output vocabulary that sums to 1.0: it is a distribution over all target words. During test time, we would select the word with the highest probability for our translation.\n",
    "\n",
    "Now, for optimization, a [cross-entropy loss](https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html#cross-entropy) is used to maximize the probability of selecting the correct word at this time step. All parameters (including word embeddings) are then updated to maximize this probability.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Prelims\n",
    "\n",
    "This tutorial requires **PyTorch >= 0.4.1** and was tested with **Python 3.6**.  \n",
    "\n",
    "Make sure you have those versions, and install the packages below if you don't have them yet."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "#!pip install torch numpy matplotlib sacrebleu"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CUDA: True\n",
      "cuda:0\n"
     ]
    }
   ],
   "source": [
    "%matplotlib inline\n",
    "import numpy as np\n",
    "import torch\n",
    "import torch.nn as nn\n",
    "import torch.nn.functional as F\n",
    "import math, copy, time\n",
    "import matplotlib.pyplot as plt\n",
    "from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence\n",
    "from IPython.core.debugger import set_trace\n",
    "\n",
    "# we will use CUDA if it is available\n",
    "USE_CUDA = torch.cuda.is_available()\n",
    "DEVICE=torch.device('cuda:0') # or set to 'cpu'\n",
    "print(\"CUDA:\", USE_CUDA)\n",
    "print(DEVICE)\n",
    "\n",
    "seed = 42\n",
    "np.random.seed(seed)\n",
    "torch.manual_seed(seed)\n",
    "torch.cuda.manual_seed(seed)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Let's start coding!\n",
    "\n",
    "## Model class\n",
    "\n",
    "Our base model class `EncoderDecoder` is very similar to the one in *The Annotated Transformer*.\n",
    "\n",
    "One difference is that our encoder also returns its final states (`encoder_final` below), which is used to initialize the decoder RNN. We also provide the sequence lengths as the RNNs require those."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "class EncoderDecoder(nn.Module):\n",
    "    \"\"\"\n",
    "    A standard Encoder-Decoder architecture. Base for this and many \n",
    "    other models.\n",
    "    \"\"\"\n",
    "    def __init__(self, encoder, decoder, src_embed, trg_embed, generator):\n",
    "        super(EncoderDecoder, self).__init__()\n",
    "        self.encoder = encoder\n",
    "        self.decoder = decoder\n",
    "        self.src_embed = src_embed\n",
    "        self.trg_embed = trg_embed\n",
    "        self.generator = generator\n",
    "        \n",
    "    def forward(self, src, trg, src_mask, trg_mask, src_lengths, trg_lengths):\n",
    "        \"\"\"Take in and process masked src and target sequences.\"\"\"\n",
    "        encoder_hidden, encoder_final = self.encode(src, src_mask, src_lengths)\n",
    "        return self.decode(encoder_hidden, encoder_final, src_mask, trg, trg_mask)\n",
    "    \n",
    "    def encode(self, src, src_mask, src_lengths):\n",
    "        return self.encoder(self.src_embed(src), src_mask, src_lengths)\n",
    "    \n",
    "    def decode(self, encoder_hidden, encoder_final, src_mask, trg, trg_mask,\n",
    "               decoder_hidden=None):\n",
    "        return self.decoder(self.trg_embed(trg), encoder_hidden, encoder_final,\n",
    "                            src_mask, trg_mask, hidden=decoder_hidden)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To keep things easy we also keep the `Generator` class the same. \n",
    "It simply projects the pre-output layer ($x$ in the `forward` function below) to obtain the output layer, so that the final dimension is the target vocabulary size."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "class Generator(nn.Module):\n",
    "    \"\"\"Define standard linear + softmax generation step.\"\"\"\n",
    "    def __init__(self, hidden_size, vocab_size):\n",
    "        super(Generator, self).__init__()\n",
    "        self.proj = nn.Linear(hidden_size, vocab_size, bias=False)\n",
    "\n",
    "    def forward(self, x):\n",
    "        return F.log_softmax(self.proj(x), dim=-1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Encoder\n",
    "\n",
    "Our encoder is a bi-directional GRU. \n",
    "\n",
    "Because we want to process multiple sentences at the same time for speed reasons (it is more effcient on GPU), we need to support **mini-batches**. Sentences in a mini-batch may have different lengths, which means that the RNN needs to unroll further for certain sentences while it might already have finished for others:\n",
    "\n",
    "```\n",
    "Example: mini-batch with 3 source sentences of different lengths (7, 5, and 3).\n",
    "End-of-sequence is marked with a \"3\" here, and padding positions with \"1\".\n",
    "\n",
    "+---------------+\n",
    "| 4 5 9 8 7 8 3 |\n",
    "+---------------+\n",
    "| 5 4 8 7 3 1 1 |\n",
    "+---------------+\n",
    "| 5 8 3 1 1 1 1 |\n",
    "+---------------+\n",
    "```\n",
    "You can see that, when computing hidden states for this mini-batch, for sentence #2 and #3 we will need to stop updating the hidden state after we have encountered \"3\". We don't want to incorporate the padding values (1s).\n",
    "\n",
    "Luckily, PyTorch has convenient helper functions called `pack_padded_sequence` and `pad_packed_sequence`.\n",
    "These functions take care of masking and padding, so that the resulting word representations are simply zeros after a sentence stops.\n",
    "\n",
    "The code below reads in a source sentence (a sequence of word embeddings) and produces the hidden states.\n",
    "It also returns a final vector, a summary of the complete sentence, by concatenating the first and the last hidden states (they have both seen the whole sentence, each in a different direction). We will use the final vector to initialize the decoder."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "class Encoder(nn.Module):\n",
    "    \"\"\"Encodes a sequence of word embeddings\"\"\"\n",
    "    def __init__(self, input_size, hidden_size, num_layers=1, dropout=0.):\n",
    "        super(Encoder, self).__init__()\n",
    "        self.num_layers = num_layers\n",
    "        self.rnn = nn.GRU(input_size, hidden_size, num_layers, \n",
    "                          batch_first=True, bidirectional=True, dropout=dropout)\n",
    "        \n",
    "    def forward(self, x, mask, lengths):\n",
    "        \"\"\"\n",
    "        Applies a bidirectional GRU to sequence of embeddings x.\n",
    "        The input mini-batch x needs to be sorted by length.\n",
    "        x should have dimensions [batch, time, dim].\n",
    "        \"\"\"\n",
    "        packed = pack_padded_sequence(x, lengths, batch_first=True)\n",
    "        output, final = self.rnn(packed)\n",
    "        output, _ = pad_packed_sequence(output, batch_first=True)\n",
    "\n",
    "        # we need to manually concatenate the final states for both directions\n",
    "        fwd_final = final[0:final.size(0):2]\n",
    "        bwd_final = final[1:final.size(0):2]\n",
    "        final = torch.cat([fwd_final, bwd_final], dim=2)  # [num_layers, batch, 2*dim]\n",
    "\n",
    "        return output, final"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Decoder\n",
    "\n",
    "The decoder is a conditional GRU. Rather than starting with an empty state like the encoder, its initial hidden state results from a projection of the encoder final vector. \n",
    "\n",
    "#### Training\n",
    "In `forward` you can find a for-loop that computes the decoder hidden states one time step at a time. \n",
    "Note that, during training, we know exactly what the target words should be! (They are in `trg_embed`.) This means that we are not even checking here what the prediction is! We simply feed the correct previous target word embedding to the GRU at each time step. This is called teacher forcing.\n",
    "\n",
    "The `forward` function returns all decoder hidden states and pre-output vectors. Elsewhere these are used to compute the loss, after which the parameters are updated.\n",
    "\n",
    "#### Prediction\n",
    "For prediction time, for forward function is only used for a single time step. After predicting a word from the returned pre-output vector, we can call it again, supplying it the word embedding of the previously predicted word and the last state."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "class Decoder(nn.Module):\n",
    "    \"\"\"A conditional RNN decoder with attention.\"\"\"\n",
    "    \n",
    "    def __init__(self, emb_size, hidden_size, attention, num_layers=1, dropout=0.5,\n",
    "                 bridge=True):\n",
    "        super(Decoder, self).__init__()\n",
    "        \n",
    "        self.hidden_size = hidden_size\n",
    "        self.num_layers = num_layers\n",
    "        self.attention = attention\n",
    "        self.dropout = dropout\n",
    "                 \n",
    "        self.rnn = nn.GRU(emb_size + 2*hidden_size, hidden_size, num_layers,\n",
    "                          batch_first=True, dropout=dropout)\n",
    "                 \n",
    "        # to initialize from the final encoder state\n",
    "        self.bridge = nn.Linear(2*hidden_size, hidden_size, bias=True) if bridge else None\n",
    "\n",
    "        self.dropout_layer = nn.Dropout(p=dropout)\n",
    "        self.pre_output_layer = nn.Linear(hidden_size + 2*hidden_size + emb_size,\n",
    "                                          hidden_size, bias=False)\n",
    "        \n",
    "    def forward_step(self, prev_embed, encoder_hidden, src_mask, proj_key, hidden):\n",
    "        \"\"\"Perform a single decoder step (1 word)\"\"\"\n",
    "\n",
    "        # compute context vector using attention mechanism\n",
    "        query = hidden[-1].unsqueeze(1)  # [#layers, B, D] -> [B, 1, D]\n",
    "        context, attn_probs = self.attention(\n",
    "            query=query, proj_key=proj_key,\n",
    "            value=encoder_hidden, mask=src_mask)\n",
    "\n",
    "        # update rnn hidden state\n",
    "        rnn_input = torch.cat([prev_embed, context], dim=2)\n",
    "        output, hidden = self.rnn(rnn_input, hidden)\n",
    "        \n",
    "        pre_output = torch.cat([prev_embed, output, context], dim=2)\n",
    "        pre_output = self.dropout_layer(pre_output)\n",
    "        pre_output = self.pre_output_layer(pre_output)\n",
    "\n",
    "        return output, hidden, pre_output\n",
    "    \n",
    "    def forward(self, trg_embed, encoder_hidden, encoder_final, \n",
    "                src_mask, trg_mask, hidden=None, max_len=None):\n",
    "        \"\"\"Unroll the decoder one step at a time.\"\"\"\n",
    "                                         \n",
    "        # the maximum number of steps to unroll the RNN\n",
    "        if max_len is None:\n",
    "            max_len = trg_mask.size(-1)\n",
    "\n",
    "        # initialize decoder hidden state\n",
    "        if hidden is None:\n",
    "            hidden = self.init_hidden(encoder_final)\n",
    "        \n",
    "        # pre-compute projected encoder hidden states\n",
    "        # (the \"keys\" for the attention mechanism)\n",
    "        # this is only done for efficiency\n",
    "        proj_key = self.attention.key_layer(encoder_hidden)\n",
    "        \n",
    "        # here we store all intermediate hidden states and pre-output vectors\n",
    "        decoder_states = []\n",
    "        pre_output_vectors = []\n",
    "        \n",
    "        # unroll the decoder RNN for max_len steps\n",
    "        for i in range(max_len):\n",
    "            prev_embed = trg_embed[:, i].unsqueeze(1)\n",
    "            output, hidden, pre_output = self.forward_step(\n",
    "              prev_embed, encoder_hidden, src_mask, proj_key, hidden)\n",
    "            decoder_states.append(output)\n",
    "            pre_output_vectors.append(pre_output)\n",
    "\n",
    "        decoder_states = torch.cat(decoder_states, dim=1)\n",
    "        pre_output_vectors = torch.cat(pre_output_vectors, dim=1)\n",
    "        return decoder_states, hidden, pre_output_vectors  # [B, N, D]\n",
    "\n",
    "    def init_hidden(self, encoder_final):\n",
    "        \"\"\"Returns the initial decoder state,\n",
    "        conditioned on the final encoder state.\"\"\"\n",
    "\n",
    "        if encoder_final is None:\n",
    "            return None  # start with zeros\n",
    "\n",
    "        return torch.tanh(self.bridge(encoder_final))            \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Attention                                                                                                                                                                               \n",
    "\n",
    "At every time step, the decoder has access to *all* source word representations $\\mathbf{h}_1, \\dots, \\mathbf{h}_M$. \n",
    "An attention mechanism allows the model to focus on the currently most relevant part of the source sentence.\n",
    "The state of the decoder is represented by GRU hidden state $\\mathbf{s}_i$.\n",
    "So if we want to know which source word representation(s) $\\mathbf{h}_j$ are most relevant, we will need to define a function that takes those two things as input.\n",
    "\n",
    "Here we use the MLP-based, additive attention that was used in Bahdanau et al.:\n",
    "\n",
    "<img src=\"images/attention.png\" width=\"280\">\n",
    "\n",
    "\n",
    "We apply an MLP with tanh-activation to both the current decoder state $\\bf s_i$ (the *query*) and each encoder state $\\bf h_j$ (the *key*), and then project this to a single value (i.e. a scalar) to get the *attention energy* $e_{ij}$. \n",
    "\n",
    "Once all energies are computed, they are normalized by a softmax so that they sum to one: \n",
    "\n",
    "$$ \\alpha_{ij} = \\text{softmax}(\\mathbf{e}_i)[j] $$\n",
    "\n",
    "$$\\sum_j \\alpha_{ij} = 1.0$$ \n",
    "\n",
    "The context vector for time step $i$ is then a weighted sum of the encoder hidden states (the *values*):\n",
    "$$\\mathbf{c}_i = \\sum_j \\alpha_{ij} \\mathbf{h}_j$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "class BahdanauAttention(nn.Module):\n",
    "    \"\"\"Implements Bahdanau (MLP) attention\"\"\"\n",
    "    \n",
    "    def __init__(self, hidden_size, key_size=None, query_size=None):\n",
    "        super(BahdanauAttention, self).__init__()\n",
    "        \n",
    "        # We assume a bi-directional encoder so key_size is 2*hidden_size\n",
    "        key_size = 2 * hidden_size if key_size is None else key_size\n",
    "        query_size = hidden_size if query_size is None else query_size\n",
    "\n",
    "        self.key_layer = nn.Linear(key_size, hidden_size, bias=False)\n",
    "        self.query_layer = nn.Linear(query_size, hidden_size, bias=False)\n",
    "        self.energy_layer = nn.Linear(hidden_size, 1, bias=False)\n",
    "        \n",
    "        # to store attention scores\n",
    "        self.alphas = None\n",
    "        \n",
    "    def forward(self, query=None, proj_key=None, value=None, mask=None):\n",
    "        assert mask is not None, \"mask is required\"\n",
    "\n",
    "        # We first project the query (the decoder state).\n",
    "        # The projected keys (the encoder states) were already pre-computated.\n",
    "        query = self.query_layer(query)\n",
    "        \n",
    "        # Calculate scores.\n",
    "        scores = self.energy_layer(torch.tanh(query + proj_key))\n",
    "        scores = scores.squeeze(2).unsqueeze(1)\n",
    "        \n",
    "        # Mask out invalid positions.\n",
    "        # The mask marks valid positions so we invert it using `mask & 0`.\n",
    "        scores.data.masked_fill_(mask == 0, -float('inf'))\n",
    "        \n",
    "        # Turn scores to probabilities.\n",
    "        alphas = F.softmax(scores, dim=-1)\n",
    "        self.alphas = alphas        \n",
    "        \n",
    "        # The context vector is the weighted sum of the values.\n",
    "        context = torch.bmm(alphas, value)\n",
    "        \n",
    "        # context shape: [B, 1, 2D], alphas shape: [B, 1, M]\n",
    "        return context, alphas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Embeddings and Softmax                                                                                                                                                                                                                                                                                           \n",
    "We use learned embeddings to convert the input tokens and output tokens to vectors of dimension `emb_size`.\n",
    "\n",
    "We will simply use PyTorch's [nn.Embedding](https://pytorch.org/docs/stable/nn.html?highlight=embedding#torch.nn.Embedding) class."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Full Model\n",
    "\n",
    "Here we define a function from hyperparameters to a full model. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "def make_model(src_vocab, tgt_vocab, emb_size=256, hidden_size=512, num_layers=1, dropout=0.1):\n",
    "    \"Helper: Construct a model from hyperparameters.\"\n",
    "\n",
    "    attention = BahdanauAttention(hidden_size)\n",
    "\n",
    "    model = EncoderDecoder(\n",
    "        Encoder(emb_size, hidden_size, num_layers=num_layers, dropout=dropout),\n",
    "        Decoder(emb_size, hidden_size, attention, num_layers=num_layers, dropout=dropout),\n",
    "        nn.Embedding(src_vocab, emb_size),\n",
    "        nn.Embedding(tgt_vocab, emb_size),\n",
    "        Generator(hidden_size, tgt_vocab))\n",
    "\n",
    "    return model.cuda() if USE_CUDA else model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Training\n",
    "\n",
    "This section describes the training regime for our models."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We stop for a quick interlude to introduce some of the tools \n",
    "needed to train a standard encoder decoder model. First we define a batch object that holds the src and target sentences for training, as well as their lengths and masks. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Batches and Masking"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "class Batch:\n",
    "    \"\"\"Object for holding a batch of data with mask during training.\n",
    "    Input is a batch from a torch text iterator.\n",
    "    \"\"\"\n",
    "    def __init__(self, src, trg, pad_index=0):\n",
    "        \n",
    "        src, src_lengths = src\n",
    "        \n",
    "        self.src = src\n",
    "        self.src_lengths = src_lengths\n",
    "        self.src_mask = (src != pad_index).unsqueeze(-2)\n",
    "        self.nseqs = src.size(0)\n",
    "        \n",
    "        self.trg = None\n",
    "        self.trg_y = None\n",
    "        self.trg_mask = None\n",
    "        self.trg_lengths = None\n",
    "        self.ntokens = None\n",
    "\n",
    "        if trg is not None:\n",
    "            trg, trg_lengths = trg\n",
    "            self.trg = trg[:, :-1]\n",
    "            self.trg_lengths = trg_lengths\n",
    "            self.trg_y = trg[:, 1:]\n",
    "            self.trg_mask = (self.trg_y != pad_index)\n",
    "            self.ntokens = (self.trg_y != pad_index).data.sum().item()\n",
    "        \n",
    "        if USE_CUDA:\n",
    "            self.src = self.src.cuda()\n",
    "            self.src_mask = self.src_mask.cuda()\n",
    "\n",
    "            if trg is not None:\n",
    "                self.trg = self.trg.cuda()\n",
    "                self.trg_y = self.trg_y.cuda()\n",
    "                self.trg_mask = self.trg_mask.cuda()\n",
    "                "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training Loop\n",
    "The code below trains the model for 1 epoch (=1 pass through the training data)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "def run_epoch(data_iter, model, loss_compute, print_every=50):\n",
    "    \"\"\"Standard Training and Logging Function\"\"\"\n",
    "\n",
    "    start = time.time()\n",
    "    total_tokens = 0\n",
    "    total_loss = 0\n",
    "    print_tokens = 0\n",
    "\n",
    "    for i, batch in enumerate(data_iter, 1):\n",
    "        \n",
    "        out, _, pre_output = model.forward(batch.src, batch.trg,\n",
    "                                           batch.src_mask, batch.trg_mask,\n",
    "                                           batch.src_lengths, batch.trg_lengths)\n",
    "        loss = loss_compute(pre_output, batch.trg_y, batch.nseqs)\n",
    "        total_loss += loss\n",
    "        total_tokens += batch.ntokens\n",
    "        print_tokens += batch.ntokens\n",
    "        \n",
    "        if model.training and i % print_every == 0:\n",
    "            elapsed = time.time() - start\n",
    "            print(\"Epoch Step: %d Loss: %f Tokens per Sec: %f\" %\n",
    "                    (i, loss / batch.nseqs, print_tokens / elapsed))\n",
    "            start = time.time()\n",
    "            print_tokens = 0\n",
    "\n",
    "    return math.exp(total_loss / float(total_tokens))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training Data and Batching\n",
    "\n",
    "We will use torch text for batching. This is discussed in more detail below. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Optimizer\n",
    "\n",
    "We will use the [Adam optimizer](https://arxiv.org/abs/1412.6980) with default settings ($\\beta_1=0.9$, $\\beta_2=0.999$ and $\\epsilon=10^{-8}$).\n",
    "\n",
    "We will use $0.0003$ as the learning rate here, but for different problems another learning rate may be more appropriate. You will have to tune that."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# A First  Example\n",
    "\n",
    "We can begin by trying out a simple copy-task. Given a random set of input symbols from a small vocabulary, the goal is to generate back those same symbols. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Synthetic Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "def data_gen(num_words=11, batch_size=16, num_batches=100, length=10, pad_index=0, sos_index=1):\n",
    "    \"\"\"Generate random data for a src-tgt copy task.\"\"\"\n",
    "    for i in range(num_batches):\n",
    "        data = torch.from_numpy(\n",
    "          np.random.randint(1, num_words, size=(batch_size, length)))\n",
    "        data[:, 0] = sos_index\n",
    "        data = data.cuda() if USE_CUDA else data\n",
    "        src = data[:, 1:]\n",
    "        trg = data\n",
    "        src_lengths = [length-1] * batch_size\n",
    "        trg_lengths = [length] * batch_size\n",
    "        yield Batch((src, src_lengths), (trg, trg_lengths), pad_index=pad_index)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loss Computation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "class SimpleLossCompute:\n",
    "    \"\"\"A simple loss compute and train function.\"\"\"\n",
    "\n",
    "    def __init__(self, generator, criterion, opt=None):\n",
    "        self.generator = generator\n",
    "        self.criterion = criterion\n",
    "        self.opt = opt\n",
    "\n",
    "    def __call__(self, x, y, norm):\n",
    "        x = self.generator(x)\n",
    "        loss = self.criterion(x.contiguous().view(-1, x.size(-1)),\n",
    "                              y.contiguous().view(-1))\n",
    "        loss = loss / norm\n",
    "\n",
    "        if self.opt is not None:\n",
    "            loss.backward()          \n",
    "            self.opt.step()\n",
    "            self.opt.zero_grad()\n",
    "\n",
    "        return loss.data.item() * norm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Printing examples\n",
    "\n",
    "To monitor progress during training, we will translate a few examples.\n",
    "\n",
    "We use greedy decoding for simplicity; that is, at each time step, starting at the first token, we choose the one with that maximum probability, and we never revisit that choice. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "def greedy_decode(model, src, src_mask, src_lengths, max_len=100, sos_index=1, eos_index=None):\n",
    "    \"\"\"Greedily decode a sentence.\"\"\"\n",
    "\n",
    "    with torch.no_grad():\n",
    "        encoder_hidden, encoder_final = model.encode(src, src_mask, src_lengths)\n",
    "        prev_y = torch.ones(1, 1).fill_(sos_index).type_as(src)\n",
    "        trg_mask = torch.ones_like(prev_y)\n",
    "\n",
    "    output = []\n",
    "    attention_scores = []\n",
    "    hidden = None\n",
    "\n",
    "    for i in range(max_len):\n",
    "        with torch.no_grad():\n",
    "            out, hidden, pre_output = model.decode(\n",
    "              encoder_hidden, encoder_final, src_mask,\n",
    "              prev_y, trg_mask, hidden)\n",
    "\n",
    "            # we predict from the pre-output layer, which is\n",
    "            # a combination of Decoder state, prev emb, and context\n",
    "            prob = model.generator(pre_output[:, -1])\n",
    "\n",
    "        _, next_word = torch.max(prob, dim=1)\n",
    "        next_word = next_word.data.item()\n",
    "        output.append(next_word)\n",
    "        prev_y = torch.ones(1, 1).type_as(src).fill_(next_word)\n",
    "        attention_scores.append(model.decoder.attention.alphas.cpu().numpy())\n",
    "    \n",
    "    output = np.array(output)\n",
    "        \n",
    "    # cut off everything starting from </s> \n",
    "    # (only when eos_index provided)\n",
    "    if eos_index is not None:\n",
    "        first_eos = np.where(output==eos_index)[0]\n",
    "        if len(first_eos) > 0:\n",
    "            output = output[:first_eos[0]]      \n",
    "    \n",
    "    return output, np.concatenate(attention_scores, axis=1)\n",
    "  \n",
    "\n",
    "def lookup_words(x, vocab=None):\n",
    "    if vocab is not None:\n",
    "        x = [vocab.itos[i] for i in x]\n",
    "\n",
    "    return [str(t) for t in x]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "def print_examples(example_iter, model, n=2, max_len=100, \n",
    "                   sos_index=1, \n",
    "                   src_eos_index=None, \n",
    "                   trg_eos_index=None, \n",
    "                   src_vocab=None, trg_vocab=None):\n",
    "    \"\"\"Prints N examples. Assumes batch size of 1.\"\"\"\n",
    "\n",
    "    model.eval()\n",
    "    count = 0\n",
    "    print()\n",
    "    \n",
    "    if src_vocab is not None and trg_vocab is not None:\n",
    "        src_eos_index = src_vocab.stoi[EOS_TOKEN]\n",
    "        trg_sos_index = trg_vocab.stoi[SOS_TOKEN]\n",
    "        trg_eos_index = trg_vocab.stoi[EOS_TOKEN]\n",
    "    else:\n",
    "        src_eos_index = None\n",
    "        trg_sos_index = 1\n",
    "        trg_eos_index = None\n",
    "        \n",
    "    for i, batch in enumerate(example_iter):\n",
    "      \n",
    "        src = batch.src.cpu().numpy()[0, :]\n",
    "        trg = batch.trg_y.cpu().numpy()[0, :]\n",
    "\n",
    "        # remove </s> (if it is there)\n",
    "        src = src[:-1] if src[-1] == src_eos_index else src\n",
    "        trg = trg[:-1] if trg[-1] == trg_eos_index else trg      \n",
    "      \n",
    "        result, _ = greedy_decode(\n",
    "          model, batch.src, batch.src_mask, batch.src_lengths,\n",
    "          max_len=max_len, sos_index=trg_sos_index, eos_index=trg_eos_index)\n",
    "        print(\"Example #%d\" % (i+1))\n",
    "        print(\"Src : \", \" \".join(lookup_words(src, vocab=src_vocab)))\n",
    "        print(\"Trg : \", \" \".join(lookup_words(trg, vocab=trg_vocab)))\n",
    "        print(\"Pred: \", \" \".join(lookup_words(result, vocab=trg_vocab)))\n",
    "        print()\n",
    "        \n",
    "        count += 1\n",
    "        if count == n:\n",
    "            break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training the copy task"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "def train_copy_task():\n",
    "    \"\"\"Train the simple copy task.\"\"\"\n",
    "    num_words = 11\n",
    "    criterion = nn.NLLLoss(reduction=\"sum\", ignore_index=0)\n",
    "    model = make_model(num_words, num_words, emb_size=32, hidden_size=64)\n",
    "    optim = torch.optim.Adam(model.parameters(), lr=0.0003)\n",
    "    eval_data = list(data_gen(num_words=num_words, batch_size=1, num_batches=100))\n",
    " \n",
    "    dev_perplexities = []\n",
    "    \n",
    "    if USE_CUDA:\n",
    "        model.cuda()\n",
    "\n",
    "    for epoch in range(10):\n",
    "        \n",
    "        print(\"Epoch %d\" % epoch)\n",
    "\n",
    "        # train\n",
    "        model.train()\n",
    "        data = data_gen(num_words=num_words, batch_size=32, num_batches=100)\n",
    "        run_epoch(data, model,\n",
    "                  SimpleLossCompute(model.generator, criterion, optim))\n",
    "\n",
    "        # evaluate\n",
    "        model.eval()\n",
    "        with torch.no_grad(): \n",
    "            perplexity = run_epoch(eval_data, model,\n",
    "                                   SimpleLossCompute(model.generator, criterion, None))\n",
    "            print(\"Evaluation perplexity: %f\" % perplexity)\n",
    "            dev_perplexities.append(perplexity)\n",
    "            print_examples(eval_data, model, n=2, max_len=9)\n",
    "        \n",
    "    return dev_perplexities"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/rnn.py:38: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.1 and num_layers=1\n",
      "  \"num_layers={}\".format(dropout, num_layers))\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 0\n",
      "Epoch Step: 50 Loss: 19.887581 Tokens per Sec: 7748.957397\n",
      "Epoch Step: 100 Loss: 17.856726 Tokens per Sec: 7925.338918\n",
      "Evaluation perplexity: 7.172198\n",
      "\n",
      "Example #1\n",
      "Src :  4 8 5 7 10 3 7 8 5\n",
      "Trg :  4 8 5 7 10 3 7 8 5\n",
      "Pred:  8 3 7 5 8 3 7 5 8\n",
      "\n",
      "Example #2\n",
      "Src :  8 8 3 6 5 2 8 6 2\n",
      "Trg :  8 8 3 6 5 2 8 6 2\n",
      "Pred:  8 8 8 8 8 8 8 8 8\n",
      "\n",
      "Epoch 1\n",
      "Epoch Step: 50 Loss: 15.715487 Tokens per Sec: 8662.903188\n",
      "Epoch Step: 100 Loss: 12.368280 Tokens per Sec: 7860.172940\n",
      "Evaluation perplexity: 3.709498\n",
      "\n",
      "Example #1\n",
      "Src :  4 8 5 7 10 3 7 8 5\n",
      "Trg :  4 8 5 7 10 3 7 8 5\n",
      "Pred:  4 8 7 5 10 8 7 5 7\n",
      "\n",
      "Example #2\n",
      "Src :  8 8 3 6 5 2 8 6 2\n",
      "Trg :  8 8 3 6 5 2 8 6 2\n",
      "Pred:  8 8 5 6 2 6 8 2 5\n",
      "\n",
      "Epoch 2\n",
      "Epoch Step: 50 Loss: 9.246480 Tokens per Sec: 7971.095313\n",
      "Epoch Step: 100 Loss: 7.701921 Tokens per Sec: 7876.198908\n",
      "Evaluation perplexity: 2.303158\n",
      "\n",
      "Example #1\n",
      "Src :  4 8 5 7 10 3 7 8 5\n",
      "Trg :  4 8 5 7 10 3 7 8 5\n",
      "Pred:  4 8 7 3 10 5 8 7 5\n",
      "\n",
      "Example #2\n",
      "Src :  8 8 3 6 5 2 8 6 2\n",
      "Trg :  8 8 3 6 5 2 8 6 2\n",
      "Pred:  8 8 5 6 2 6 8 5 2\n",
      "\n",
      "Epoch 3\n",
      "Epoch Step: 50 Loss: 6.166847 Tokens per Sec: 8069.631171\n",
      "Epoch Step: 100 Loss: 5.673258 Tokens per Sec: 7855.858586\n",
      "Evaluation perplexity: 1.775795\n",
      "\n",
      "Example #1\n",
      "Src :  4 8 5 7 10 3 7 8 5\n",
      "Trg :  4 8 5 7 10 3 7 8 5\n",
      "Pred:  4 8 7 5 10 3 7 8 5\n",
      "\n",
      "Example #2\n",
      "Src :  8 8 3 6 5 2 8 6 2\n",
      "Trg :  8 8 3 6 5 2 8 6 2\n",
      "Pred:  8 8 3 6 5 2 8 6 8\n",
      "\n",
      "Epoch 4\n",
      "Epoch Step: 50 Loss: 4.830031 Tokens per Sec: 8094.515152\n",
      "Epoch Step: 100 Loss: 4.152125 Tokens per Sec: 7999.315744\n",
      "Evaluation perplexity: 1.572305\n",
      "\n",
      "Example #1\n",
      "Src :  4 8 5 7 10 3 7 8 5\n",
      "Trg :  4 8 5 7 10 3 7 8 5\n",
      "Pred:  4 8 5 7 10 3 7 8 5\n",
      "\n",
      "Example #2\n",
      "Src :  8 8 3 6 5 2 8 6 2\n",
      "Trg :  8 8 3 6 5 2 8 6 2\n",
      "Pred:  8 8 3 6 5 2 8 6 2\n",
      "\n",
      "Epoch 5\n",
      "Epoch Step: 50 Loss: 3.638369 Tokens per Sec: 8112.868501\n",
      "Epoch Step: 100 Loss: 3.784709 Tokens per Sec: 7843.288141\n",
      "Evaluation perplexity: 1.433951\n",
      "\n",
      "Example #1\n",
      "Src :  4 8 5 7 10 3 7 8 5\n",
      "Trg :  4 8 5 7 10 3 7 8 5\n",
      "Pred:  4 8 7 5 3 10 7 8 7\n",
      "\n",
      "Example #2\n",
      "Src :  8 8 3 6 5 2 8 6 2\n",
      "Trg :  8 8 3 6 5 2 8 6 2\n",
      "Pred:  8 8 3 6 5 2 8 6 2\n",
      "\n",
      "Epoch 6\n",
      "Epoch Step: 50 Loss: 2.802792 Tokens per Sec: 8128.952327\n",
      "Epoch Step: 100 Loss: 2.403310 Tokens per Sec: 7893.746819\n",
      "Evaluation perplexity: 1.284198\n",
      "\n",
      "Example #1\n",
      "Src :  4 8 5 7 10 3 7 8 5\n",
      "Trg :  4 8 5 7 10 3 7 8 5\n",
      "Pred:  4 8 5 7 10 3 7 8 5\n",
      "\n",
      "Example #2\n",
      "Src :  8 8 3 6 5 2 8 6 2\n",
      "Trg :  8 8 3 6 5 2 8 6 2\n",
      "Pred:  8 8 3 6 5 2 8 6 2\n",
      "\n",
      "Epoch 7\n",
      "Epoch Step: 50 Loss: 2.174423 Tokens per Sec: 8181.341663\n",
      "Epoch Step: 100 Loss: 1.838792 Tokens per Sec: 7833.160747\n",
      "Evaluation perplexity: 1.173110\n",
      "\n",
      "Example #1\n",
      "Src :  4 8 5 7 10 3 7 8 5\n",
      "Trg :  4 8 5 7 10 3 7 8 5\n",
      "Pred:  4 8 5 7 10 3 7 8 5\n",
      "\n",
      "Example #2\n",
      "Src :  8 8 3 6 5 2 8 6 2\n",
      "Trg :  8 8 3 6 5 2 8 6 2\n",
      "Pred:  8 8 3 6 5 2 8 6 2\n",
      "\n",
      "Epoch 8\n",
      "Epoch Step: 50 Loss: 1.226522 Tokens per Sec: 8267.548130\n",
      "Epoch Step: 100 Loss: 1.090876 Tokens per Sec: 7842.856308\n",
      "Evaluation perplexity: 1.123090\n",
      "\n",
      "Example #1\n",
      "Src :  4 8 5 7 10 3 7 8 5\n",
      "Trg :  4 8 5 7 10 3 7 8 5\n",
      "Pred:  4 8 5 7 10 3 7 8 5\n",
      "\n",
      "Example #2\n",
      "Src :  8 8 3 6 5 2 8 6 2\n",
      "Trg :  8 8 3 6 5 2 8 6 2\n",
      "Pred:  8 8 3 6 5 2 8 6 2\n",
      "\n",
      "Epoch 9\n",
      "Epoch Step: 50 Loss: 1.216270 Tokens per Sec: 8181.132215\n",
      "Epoch Step: 100 Loss: 0.636999 Tokens per Sec: 7866.309111\n",
      "Evaluation perplexity: 1.088564\n",
      "\n",
      "Example #1\n",
      "Src :  4 8 5 7 10 3 7 8 5\n",
      "Trg :  4 8 5 7 10 3 7 8 5\n",
      "Pred:  4 8 5 7 10 3 7 8 5\n",
      "\n",
      "Example #2\n",
      "Src :  8 8 3 6 5 2 8 6 2\n",
      "Trg :  8 8 3 6 5 2 8 6 2\n",
      "Pred:  8 8 3 6 5 2 8 6 2\n",
      "\n"
     ]
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYcAAAElCAYAAAAPyi6bAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAAIABJREFUeJzt3Xl8XHW9//HXJ5ksTdqkbdp0L91bFilgla0spYiKG4qIiF5xwQVZVFx+3quIuFwVxAXtlUVEUUQU1KtXBSmUrWwtlNLaBbpDl3RNmqTZP78/zpl0moXMpMmczMz7+XjMI5kzc875zKSd95zv93y/x9wdERGRRHlRFyAiIgOPwkFERDpROIiISCcKBxER6UThICIinSgcRESkE4WDdGJm15qZJ9x2m9njZva2iOpxM/tqP237WjNrSbg/NFx2bH/sL926+Fsm3n4TYV2LzOzBqPYvPYtFXYAMWK3A3PD3UcDVwN/M7Bx3/1d0ZfW524B/JNwfCnwdeBlYHklFfS/xb5loZ7oLkcyhcJBuuftT8d/N7GFgM3AlcFjhYGZF7t54mOX1CXd/BXgl6joORzLvZ+LfUiQZalaSpLh7DbAWmBJfZmZDzOyHZrbFzBrNbI2ZXZq4XrzZxsxmm9kjZlYPfD98zM3sGjP7jpntMLM6M7vXzEb1VI+Zvd7M/mFm1eF695vZ0QmPHxfWdE2H9e4zs1fNrCKxvvD3ScCG8Kl3JjS/nGlmfzazZ7qo46zwOSe/Rq2LzOxBM/uAma01swYze8bM3tjFcy80syVmdsDMdpnZL8xsWMLjk8L9XWpmN5lZFbCjp/erJ2a20cxuM7PPmtnmcP8LzWx6h+cVmdl3w795k5m9ZGZXm5l1eN6YcHvbwr/DOjP7Thf7fZuZLTezejNbamanHe5rkT7i7rrpdsgNuBZo6bAsBmwDHgjvFwCLCT6YLgPOJvjQbwU+1WFbrQTNNJ8H5gEnho85wbf2fwFvBz4G7AIWd9i3A19NuD8HOBCu927gncDjwG5gbMLzvgg0J+zv40Ab8KauXitQFG7PCZqWTgpvZcBbw+XHdqjtd8CLPbyfi8L37mXgonAfS4FqYGTC8z4T1rcAeDPwYeBV4AkgL3zOpLCOrcBvw7rO6+lvGf79Ot4s4Xkbw7/Fc2F9FxEE5QagKOF5vweagP8EzgF+ENbz7YTnVITb2xb+2zgLuAS4rcN7shV4AXg/cC7wLLAPGBr1/wHdXOGgW+dbFx8oY4Gfhx8Cl4bP+Y/wg+wNHda9NfxQyEvYlgMf7WI/8Q+5xA+fd4bLz+nwvMRweAhYBsQSlpURBMv3EpblAQ8DLwHHA7XAjV291oT78Q/fD3Z4Xl74QXlTwrIKoAG4sof3cxEdggWoJAi474T3BxOExU0d1j01XPetHep7JIW/pXdzSwzxjWE9lQnLjg2f94nw/uvC+1/ssI+bw3WHhve/RRDKs3p4TxqAiQnLTgi3/76o/w/o5mpWkm7lE/wHbyb49vpB4Fp3vzV8/M0EzUzPm1ksfgPuB0YD0zps73+72c9f/dD28r8CjQTf2Dsxs0HA6QTfYEnYbz3wJNDeLOHubQQhNiJ8bD3wlZ5femfhtm4FPhjWQLhtB+5MYhNr3b29g9vdq4DHOPg6TyYIuN91eD+fBvYnvq7QX1MovxV4Qxe3ezs879GwrniNywmCNV7j6eHPuzqs9zugGIg3k50NPO7uq3uoa6W7b068H/6c2MN6kgbqkJbutBJ8KDiwF9js7i0Jj1cCMwnCoysVCb+3ufuubp5XlXjH3d3MdgJjunn+cILg+k5462hth+1tMbNHgHcBN/vhdYT/guCb+HsJAuHjwL3uvjeJdau6WLaDoIkMgvcTgiakrlR0uJ9SP4O7L0niad3VGP9bDEtYlmh7h8crCEKtJ4e8b+7eGHZdFCexrvQzhYN0q4cPlD3AGoIjiq4kfmt8rXnhKxPvhB2bIwmaprqyj6A560bCo4cOGjps7wKCYHgO+IaZ3efu3W37Nbn7DjP7M3Cpma0HjiJoU09GZRfLRnHwde4Jf15E0DfRUcfTTvtjrv3uanwp/H1vwrJXE54zOvwZfw27gHF9Xp2klZqVpLfuB44Adrv7ki5utUlu5x1mVpR4n6BjuMtTL929juDb9eu62e+K+HPNbBxBe/gvCTrCa4FfdjyzpoP4kUV3315vJmji+S5BU9EjPb9EAGYkDqwzs8pwO/HX+URY3+RuXtemJPdzOE4P64rXeCwwPaHGR8Of7++w3oUEofxseP9BYK6ZzejHWqWf6chBeutO4KPAw2Z2A0F7cSkwCzjJ3d+b5HbaCAbX/ZjgG+n3gCfd/YHXWOfzwCNm9jfgVwTNIaMIOm9fcvefhgHwK4Jvu1e6e62ZfYigI/QK4CfdbHsHwTfgi8xsDUFH6xp33x8+/hDBN+m5wJeSfI0QNL3cZ2ZfC7f5NYKzfn4IwanCZvZl4EdmNgZ4gKAfZSLBWUE/dffFKezvEGbWVR/OfndfmXB/N/BPM/smQUB/h6Cj+o6wxhfN7B7gO2ZWCCwJa/skQcf6vnA7PwQ+BCwys+sIjjDHA6e5+yd6+xokvRQO0ivu3mxm5wD/BXyW4ENsH8EHwd0pbOoWgm/ptxOcsfNPemiqcfcl4YfdtQRnUZUSfPg+RXB6J8DngDMJPpBqw/UeM7PvAd8zswfd/d9dbLvNzC4h+GB8IKxtHkGoxPtE/hRu/1cpvM6VBH0W1xG8Vy8QnJHV3lzk7gvM7BWCU3A/Gi7eQvBNfAO9l0/QId/R0xza8f9PYAVBcI4kOJr5tLs3JTznP8LXcBlBIG8K670x4XXsMbNTCN7D6wg62l8htX8XEjFz12VCJRpm5sDX3P1bUdeSCjN7EVjt7hck+fxFBKfLnt2vhR0GM9sIPOjuH4+6FhkYdOQgkoSwX+QEgsF6xwBqHpGspnAQSc4YghHhe4AvuXtXzTQiWUPNSiIi0olOZRURkU4UDiIi0knG9jmMGDHCJ02aFHUZIiIZZenSpbvcfWRPz8vYcJg0aRJLliQzXYyIiMSZWVKj7dWsJCIinSgcRESkE4WDiIh0onAQEZFOFA4iItKJwkFERDrJqXBwdx5Zu5Nr/rKCF7bs63kFEZEclbHjHHrDzLjurytZt7OOQQX5zJ4wNOqSREQGpJw6cgCYf+QoAB5cldL12UVEckruhcOs4BK563bWsXFXXcTViIgMTGkLBzNbaWa1CbcDZuZmdkK6agB4/RHDKB9UAOjoQUSkO2kLB3c/2t0Hx28E15z9t7s/l64aAGL5ecybGcw5tXBVVTp3LSKSMSJpVjKzGMEF1G9Ocb0KM5thZjNaWlp6vf94v8OzG/dQfaC519sREclWUfU5nAeUA79Ocb0rgDXAmqqq3n/rP2PmSGJ5RktbcGqriIgcKqpw+CTwe3dPdbDBTcBMYGZlZWWvd15WXMAbJw8HYKH6HUREOkl7OJjZVGA+8PNU13X33e6+1t3XxmKHN0Qj3rS0aM1OWlrbDmtbIiLZJoojh08CL7j70xHsu93ZRwZHHtUHmlm6aW+UpYiIDDhpDQczKwQuoRdHDX3tiIpSplUOBmDhap21JCKSKN1HDu8BioHfpnm/XZofHj1ovIOIyKHSGg7ufre7l7l7bTr3252zw36H9Tvr2KDR0iIi7XJu+oxEJ0wcxrCSYLS0zloSETkop8MhP8+YN1NNSyIiHeV0OEDiaOm9VNdrtLSICCgcOH3GCAryjdY2Z9FanbUkIgIKB4YUF3Di5ApAE/GJiMTlfDjAwVNaF62polmjpUVEFA4A82cF/Q41DS0s2ajR0iIiCgdgYkUJ0+OjpXXWkoiIwiEuftaSptIQEVE4tItPxLdhVx3rdg6IAdwiIpFROISOnziM4aWFgJqWREQUDqH8POPM8NrSD+qUVhHJcQqHBPGJ+JZu2su++qaIqxERiY7CIcFp0w+Olta1pUUklykcEgwpLuCkKcFoaTUtiUguUzh0MH+WRkuLiCgcOoiPd9jf0MKzG/dEXI2ISDQUDh1MGF7CzFFDAE3EJyK5S+HQhfhEfAtX7cDdI65GRCT9FA5diDctbdxdz7qdura0iOQehUMXjpswlAqNlhaRHKZw6EJ+njFvVrxpSf0OIpJ70h4OZna2mT1lZrVmtsvMFqS7hmTEJ+JbsmkPe+s0WlpEcktaw8HMzgT+CNwAVADjgdvSWUOy5k4fSWF+Hm2Ori0tIjkn3UcO/w383N3/6O6N7t7g7s+luYakDC6KceKU4YBGS4tI7klbOJhZKfBGIGZmz4VNSovMbE4K26gwsxlmNqOlpaX/ig3FJ+J7dM1Omlo0WlpEckc6jxyGhfu7CLgEGAs8APzdzIYmuY0rgDXAmqqq/v82Hx/vsL9Ro6VFJLekMxz2hz9/6e7L3b2JoJmpADglyW3cBMwEZlZWVvZDiYcaP6yEWaOD0dIP6pRWEckhaQsHd68GNgIdhxx7F8u628Zud1/r7mtjsVgfV9i1g6OlqzRaWkRyRro7pBcAHzGzo8wsBnwRaAQWp7mOpMVHS2/eU8/LVbq2tIjkhvR8/T7oBmAI8BBQDDwPvDU8qhiQjhs/lBGDC9lV28TC1VVMDyflExHJZmk9cvDANe4+2t2Huvs8d1+WzhpSlZdnzJt5cCI+EZFcoOkzkjA/4drSGi0tIrlA4ZCE06aPaB8t/fAaDYgTkeyncEhCaVGMk6cG15bWRHwikgsUDkmKT8T3yFqNlhaR7KdwSNJZYb9DbWMLz2zQaGkRyW4KhySNGzqII8eUARotLSLZT+GQgnjT0sLVura0iGQ3hUMK4qe0btlzgJc0WlpEspjCIQXHjitn5JAiQE1LIpLdFA4pyMszzpqpa0uLSPZTOKTorLDf4bnNe9ld2xhxNSIi/UPhkKLTpo+gMJaHOzy8ZmfU5YiI9AuFQ4pKCmOc0j5aWv0OIpKdFA69ED9r6dG1O2lsaY24GhGRvqdw6IX5s4J+h7qmVp5er9HSIpJ9FA69MHboII4KR0s/tFpnLYlI9lE49FJ8tPSDqzRaWkSyj8Khl+L9Dq/sPcDaHRotLSLZReHQS6/TaGkRyWIKh17Ky7P2jmmd0ioi2UbhcBjiTUvPb9nHLo2WFpEsonA4DHOnjaAoPlpaZy2JSBZROByGQYX5nDptBKCJ+EQku6QUDmb2rJldamaDU92Rmd1hZs1mVptwuyzV7Qw088NTWh97SaOlRSR7pHrk8DBwHbDNzH5hZiemuP6v3H1wwm1BiusPOPNnBf0OdU2tPKXR0iKSJVIKB3f/EjAB+DAwGnjCzF40syvNbFh/FJjIzCrMbIaZzWhpaenv3SVldHkxx4wLRkvrrCURyRYp9zm4e4u73+fubwOOAO4Dvge8ama/NbM3vMbq55vZHjNba2bX96J56gpgDbCmqmrgtPHHjx4WrqrSaGkRyQq97pA2s6nA5cAngAPAbUAxwdHENV2schMwCxgBvBs4A7g1xd3eBMwEZlZWVvay8r53dnhK66v7DrB6+/6IqxEROXypdkgXmdnFZvYwwTf404AvA2Pd/Up3Px84D7i647ruvtTdd7h7m7uvBD4HvNfMipLdv7vvdve17r42FoulUnq/OnpsGZXhaGk1LYlINkj1yGE7wbf3F4HZ7j7X3X/t7g0Jz1kMJNMz2xb+tBRrGHDy8qz9rKUHdUqriGSBVMPhc8C48ChhZVdPcPd97j6543Ize7+ZDQ1/nw78APjfDsGSseL9Di+8so+d+zVaWkQyW6rhcDrQqT3HzErN7PYe1v0UsN7M6oAHgKeAj6S4/wHrVI2WFpEskmo4fBgY1MXyQcB/vNaK7n6muw9391J3n+zun3f3mhT3P2ANKsxnbny09Gr1O4hIZks1HAw45FxNMzNgLrCzr4rKVPGJ+B57aRcNzRotLSKZK6lwMLM2M2slCIbtZtYavwEtwL3Ab/qxzowQ75Sub2rlqfW7I65GRKT3kj0f9CKCo4a7CPoOqhMeawI2uPuyPq4t44wqK+Z148p58dVqFq6q4syZA2cshohIKpIKB3f/PYCZbQOecPeBMXfFADT/yMowHHZw3buOJmh1ExHJLD02K5lZ4tffVcBwM6vs6tZ/ZWaO+GjprdUNrNqm0dIikpmSOXLYZmZj3L2KYBBcV5MHxTuq8/uyuEx09NgyRpcVs72mgYWrdnDU2LKoSxIRSVky4XAWB0c8n0XX4SAhM+OsIyu56+nNPLi6iivmT4+6JBGRlPUYDu7+SMLvi/q1mixxdhgOL2zZR9X+BiqHFEddkohISlKdeO8L3SwvNrOb+6akzHfK1BEUFwRvrUZLi0gmSnUQ3FfM7B9mNjK+wMyOBZ4DzuzLwjJZcUE+c6cFb5Em4hORTJRqOBwPDAaWm9mbzexK4GngGeCEvi4uk50dDoh7XKOlRSQDpXqZ0M0EF+n5A/B34AbgY+5+ibvX9UN9GeusWUE4HGhu5cl1Gi0tIpmlN1eCmwdcQNCUVAdcbGYj+rSqLFBZVszs8eUAPKgLAIlIhkm1Q/q7wD8ILu95EnAcMBR40czO6fvyMttZ4TUeHlqta0uLSGZJ9cjhYuBN7n6Nu7e6+yaCS4XeCvy1z6vLcPGJ+LZVN7Bya9bMTi4iOSDVcJidOO4BILwm9DXA2X1XVnY4emwZY8qDMQ4LddaSiGSQVDuk9wCYWYWZnWhmRQmPPdbXxWU6M2vvmH5IFwASkQySap/DYDP7HcGFfRYD48LlN5vZ1/uhvowXn4jvhVeqqarJistli0gOSLVZ6b+BKcCJwIGE5X8D3t1XRWWTk6dWMKggmI/wIY2WFpEMkWo4vBO4yt2f5dAJ+FYRhIZ0UFyQz9zpwZm+Gi0tIpki1XAYCXTVeD6IYNpu6UL7aOmXd2q0tIhkhFTDYTldn5V0MfDs4ZeTneaFndINzW0sXrcr4mpERHqWajhcC/zQzL5BcGGfi8zsN8DnwseSYmZ5ZrbYzNzMxqdYQ8apHFLM7AlDATUtiUhmSPVU1n8C5xHMr9QG/BdwBPBWd380hU19DqhPZd+Z7uz4Ka2rNFpaRAa+lOdWcvcH3f1Mdx/s7iXufpq7P5Ts+mY2A7gM6PLaENlqfnhK6/YajZYWkYGvNxPv9ZqZ5QG3EwTDvl6sX2FmM8xsRktLS5/X15+OHDOEseFoaU3EJyIDXY/hYGYHzKw+mVsS+7sK2O7uf+plvVcAa4A1VVWZ1XZvZu1HD5pKQ0QGuh6vIQ18mkPHNPSKmU0DrgbmHMZmbgLuAqisrFxzuDWl2/wjK7nzqU28+Go1O2oaGFWma0uLyMDUYzi4+x19tK+5BOMkVpgZHDxqWW5mX3X3BUnUshvYDTBnzuFkTDROmlJBSWE+9U2tLFxVxQdOnBh1SSIiXepVn4OZnWJmHw9vpyS52j3AVIJrQBwHnBsuPwf4dW/qyDTFBfmcFo6WXqh+BxEZwJJpVmpnZhMIPuRP5GCH8lAzewa4wN23dLeuu9eTcPqqmcX3vd3da1OqOoPNP3IU96/cweMv7+JAUyuDCvOjLklEpJNUjxxuJQiUo919uLsPB44mmDrj1lQ25O4b3d3c/ZUUa8ho82ZWYgaNLW088bJGS4vIwJRqOJwBfNrdV8UXhL9fDpzel4Vlq5FDipg9PhgtvVDXeBCRASrVcNgGdDXAoBXQ+ZlJik/Et3BVFW1tGi0tIgNPquFwDfAjM2s/zSb8/Qbga31ZWDaLj3eo2t/Ic5v3RlyNiEhnqYbDfxGMU1hvZq+Y2SvAeuCNwFfM7N/xW18Xmk1mjR7ClJGlAHz53uXUN2XWaG8RyX4pna0E3N0vVeQYM+P75x/Lhbc8xbqddVzzl5XccMHsqMsSEWmXdDiYWT7wMLDc3VOeF0kONWfScD7/phlcf/8a/rj0FU6ZWsF7Tsj62ctFJEMk3azk7q3Av4Bh/VdObvn0GVPbB8V99c8rWLczZ4Z7iMgAl2qfwypAX2/7SF6eceP7jmPkkCLqm1q5/K7ndRlRERkQUg2HLwDXm9kbw2YmOUwjhxTxowuPwwxWbavh2/+3queVRET6Warh8FeCs5WeBBp6MWW3dOHUaSO4fN40AO58ahP/eHFbxBWJSK5L9WylT/VLFcJV86fz9Po9PLNxD1+6dznHjCtnwvCSqMsSkRxlmXo94zlz5viSJUuiLqNPbas+wLk/foy99c3MnjCUP3zyZApjab1Yn4hkOTNb6u49XvMg5U8eM6s0s6vN7H/MbES47FQzm9ybQuWgMeWD2sc7vLBlHzc8kHHXMxKRLJFSOJjZ8cBq4CPAx4Cy8KE3Ad/q29Jy0/wjR/HxuUHO3vLoeh5erSmrRCT9Uj1y+AFwi7sfAzQmLL8fOLXPqspxX3rLLGaPLwfg8/csY3t1Q8QViUiuSTUcTgBu62L5VmDU4ZcjAIWxPG666ASGFMXYW9/MVXc/T6tmbxWRNEo1HFqA0i6WTwX2HH45EjexooTvnn8sAE9v2MNPFr4UcUUikktSDYd/Al80Mwvvu5kNA64jGAMhfehtx47h4hOD2dF/8tBLLF6nK8eJSHr0ZoT064F1QDFwL7ABGAr8Z9+WJgBfe/tRzBo9BHf47N3L2FXb2PNKIiKHKdVw2Au8geBI4WbgKeBqYI67q1mpHxQX5PPTD5zAoIJ8qvY38vl7XtDV40Sk3yUVDmY23Mz+CtQC1cClwA3ufpm7/8LddTpNP5pWOZhvnncMAI+u3cktj62PuCIRyXbJHjl8GzgR+DrwRYIzk37eX0VJZ+99/Xjec/w4AK6/fw1LN+nyoiLSf5INh7cCH3P377j7jcA7gbPNLNW5meQwfPO8Y5gyopTWNufK3z1PdX1z1CWJSJZKNhzGAUvjd9z930ATMDaVnZnZt81sg5nVmFmVmf3RzCamso1cVloU46cfOIHCWB6v7jvAF//4Apk6N5aIDGzJhkM+0PFramu4PBV3Ase5exkwCdiMrkudkqPGlvG1tx8FwAP/3sGvn9wUcUUiko1SaRb6nZk1JdwvBn6ZeB0Hdz/3tTbg7qsT7hrQBsxMtgAzqwAqAGbPnp3salnngydOZPHLu/jHiu18+/9W8fojhnHMuPKoyxKRLJLskcOvgC3AjoTbbwjGOCQu65GZfcDMqgnOfLoKuDaFeq8A1gBrqqpyd0I6M+O75x/L+GGDaGpt4/K7nqO2sSXqskQki0R2PQczG00ws+sT7r4oyXUSjxzWLFu2rP8KzADLtuzjvf+zmJY2513HjQ0vN2o9rygiOavfrufQV9x9O3Ar8DczG57kOrvdfa27r43FdKLUcROG8uW3zALgL8u28oclr0RckYhki6gvMxYjmMgvpbOe5KCPzZ3MvJkjAbjmf1ewdsf+iCsSkWyQtnAwszwzu9zMKsP744GfARsJLiAkvZCXZ/zgfccxqqyIhuag/+FAU2vUZYlIhkv3kcO5wAozqwOeBuqBs91dvamHYXhpIT9+//HkGazdUcs3/roy6pJEJMOlLRzcvc3dz3X3Sncvdfdx7n6xu69LVw3Z7KQpFVw1fwYAdz+7hb8sezXiikQkk0Xd5yB96PKzpnHylAoA/vO+F9m4qy7iikQkUykcskh+nvHj9x9HRWkhdU2tXP6752hsUf+DiKRO4ZBlKsuKufHC4wBY8WoN3/2H+vpFJHUKhyx0xoyRfOqMqQD88omNPLBye8QViUimUThkqavPmcEJE4cC8MU/LufVfQcirkhEMonCIUsV5Ofxk4uOp6w4RvWBZq783fM0t7ZFXZaIZAiFQxYbP6yE6y8IZq9dumkvP/zX2ogrEpFMoXDIcm8+ejSXnDIJgAWL1vHo2p3RFiQiGUHhkAO+cu4sjh5bBsDn71lGVU1DxBWJyECncMgBRbF8fvqBEygtzGdXbROf/f0yWtt0eVER6Z7CIUdMHlHKd97zOgAWr9vNgodfjrgiERnIFA455F3HjePCORMA+OGDa3l6/e6IKxKRgUrhkGOufefRTK8cTJvDVXcvY09dU88riUjOUTjkmEGF+fzs4hMoLshje00DX/jDC0R1qVgRGbgUDjloxqghXPuOowF4aHUVtz22IeKKRGSgUTjkqAvfMIF3zg6uzvrtv6/ik3cuYdW2moirEpGBQuGQo8yMb7/7GI4ZF4x/uH/lDt7648e47LdLWbNd16EWyXUKhxw2pLiAP192Kje+bzZHVJQA8PcXt/OWHz/K5Xc9x8tVCgmRXGWZ2hk5Z84cX7JkSdRlZI2W1jbue/5VfrLwJV7ZG8zgagbvmj2WK+dPZ8rIwRFXKCJ9wcyWuvucHp+ncJBEza1t3Lv0FW566OX2ab7zDM47fhxXnjWdSSNKI65QRA6HwkEOS1NLG/cs2cLPHn6ZbdXBXEz5ecb5J4zjirOmM2F4ScQVikhvKBykTzS2tPL7Z4OQ2FHTCEAsz7hgzng+M28a44cpJEQyyYALBzP7HvB2YAJQC/wf8GV339Ob7Skc0quhuZW7nt7MgkXr2FUbhERBvnHhGybwmXnTGFM+KOIKRSQZyYZDOs9WagU+CFQAs4HxwB1p3L8chuKCfD46dzKPfWkeX33bkVSUFtLc6vzmqc2c8f1FfP0vK9ihqcBFskZkzUpm9hbgHncv6836OnKIVn1TC3c+uYmfP7KOvfXNABTG8rj4xIl8+sypVA4pjrhCEenKgGtW6rRjs+uBk9z9tBTWqSA48mD27Nlrli1b1l/lSZJqG1v41eKN3PLoeqoPBCFRXJDHh046gk+eMZURg4sirlBEEg3ocDCz8wmalM5w9+dSWO9a4OsAY8aMYevWrf1Sn6Ruf0MzdzyxkVsfW09NQwsAgwry+fApk/jE6VMYXloYcYUiAgM4HMzsAuBm4Hx3fzjFdXXkMMBVH2jm9sc3cPvjG9jfGIREaWE+l5w6iUtPm8LQEoWESJQGZDiY2UeAHwDvcPcnDmdb6nMY2Krrm7nt8fXc/vgG6ppaARhcFOOjcyfzsbmTKR9UEHGFIrlpwIWDmV1J0CT0Fnd/9nC3p3DIDHvrmrj1sfXcsXgj9WFIDCmO8fG5U/jI3EmUFSskRNJpIIaDAy1AY+Jyd+++V0jCAAANN0lEQVTVpD0Kh8yyu7aRWx5dz6+f3MSB5iAkygcVcOlpk7nk1MkMLopFXKFIbhhw4dDXFA6Zaef+Rm5+ZB13PrWJxpY2AIaVFPCJ06dy8UkTdSQh0s8UDjKgVdU0sGDROu56ZjNNYUgATBlRytHjyjlmbBnHjCvn6LFl6sQW6UMKB8kI26sbWLDoZe5+ZgtNrW1dPmf8sEEcM7acY8aVhcFRzsghGj8h0hsKB8ko1fXNLH91HyterWHF1mpWvFrNpt313T5/VFkRx4wtP+QoY0x5MWaWxqpFMo/CQTJe9YFm/r21hpVhWKzYWsO6nbV09092eGkhR4dBET/SmDi8RIEhkkDhIFmprrGF1dtrgiOMMDBe2rGflrau/x0PKY61B0XQh1HO5BGl5OcpMCQ3KRwkZzQ0t7J2x/5DmqRWb9vfbR9GSWE+R4052OF9zLhyplUOpiBfl1SX7JdsOOjkcsl4xQX5HDt+KMeOH9q+rLm1jZd21LJiazUrwyOMf2+t4UBzK/VNrSzZtJclm/a2P78wlseRo4dw9Lhypo4czMThJUwcXsKE4YMoKdR/E8k9OnKQnNHa5mzYVZvQJFXNyldr2ueA6s6IwUVMHD4oITCCnxMrShg1pJg8NVFJBlGzkkgS2tqcLXvr25ukVm6tYfPuOl7Ze6DbfoxEhfl5jE8IjsTwmDC8RCO/ZcBRs5JIEvLyjCMqSjmiopS3HTumfXlLaxvbaxrYvKeeLXvq2bynns17DrTf31PXBEBTaxvrd9axfmddl9uvKC08eKSRGB4VJYwuK1bHuAxYCgeRLsTy8xg/rITxw0pgaufH9zc0syUhLDaHty176tmyt57m1uCoY3ddE7vrmli2ZV+nbRTm5zF+2KBDwmNCQl/HEE0lIhFSOIj0wpDiAo4aW8BRYztf5ba1zdkRHnV0FR67ahOOOnbVsX5X10cdIwYXMik8qpk8ooRJI0qZVFHKpBGlaq6Sfqd/YSJ9LD/PGDt0EGOHDuKkKRWdHq9rbGHL3no27+4iPPYeaJ9raldtE7tqmw45qypu5JAiJlWUtIfF5PbgKNHZVdIn9K9IJM1Ki2LMGl3GrNGdjzra2pyq/Y1s3F3Hpt11bNhVz8ZddWzcHdwamoPg2Lm/kZ37G3l2Y+fgqBxSFARGe3AERx1HDC9lUGF+v78+yQ4KB5EBJC/PGF1ezOjy4k5HHfHg2BAPi1117b9v2l3fPgV61f5GqvY38syGPZ22P7qsmEkjShKONErDpqsSigsUHHKQwkEkQyQGx8lTOwfH9pqGIDDag6Oejbvr2Ly7vn20+PaaBrbXNPDU+kODwwzGlBUHYTGi9JAmq9HlxQwpimmOqhyjcQ4iWa61zdlWfYCNu+rbgyMeIlv2HDyz6rWUFOYHwVQW3sKQGlVWzJhwecXgIp2amwE0zkFEgKCDPH5a7tzpIw55rKW1jW3VDe3NUxvC4Ni4O+gojw8ErG9qfc3xHPH9VA4pag+R9uBICJFRZcVqvsoQCgeRHBbLz2NCOL7idEYe8lg8OHbUNLT/3F7dwLaaBnZUB81TO2oa2o88giOU4LmvZWhJwcGjj4QQGRXeH1NeTPmgAjVjRUzhICJdSgyO7rS1OXvqm9heHQRHPDASw2R7TQP7Gw7OX7Wvvpl99c2s3r6/2+0WxfIONmOVF1NRWsTgonxKi2KUFMUoLQx+Ly2MUVKUz+CiGCWF+e33i2I6OjlcCgcR6bW8PGPE4CJGDC7imHHl3T6vrrElCI7wyKKrENlZ29h+IafGljY27a5/zasBvpaCfKOk8GCIHBoo+ZQUxQ4JlNKiGKVF+cE6RfFl8fvBOrEcm9Jd4SAi/a60KMbUkYOZOnJwt89pbm1j5/7GQ0JkR3h21Z66JuoaW6hvaqWuqYW6xlbqGlvaT9/tvC2n+kAz1Qea++w1FMbyKCnMpyiWR3FB1z+Lulme6s/4dopieZE1rykcRGRAKMjPax9ZnqyW1jbqmlqpTwiMuqYW6hsPhkh9Uwu18WAJfwb3D65zMHRauj17q6mlrX30ejp1FR6jyor5zcdP7Nf9pjUczOz9wGeA2UCJuyucRKTXYvl5lA/Ko3xQ301S2NTS1ilQ6hqDQGlsaaWxuY2G+M/mVhpbkvvZ1MXyZKaFb2xp63SEVNvDNUj6Qro/nPcCC4BBwC1p3reISI8KY3kUxgoZWlLY7/tqaW1LOlwaW9poDH8Wxfq//yOt4eDu9wOY2Zm9Wd/MKoAKgNmzZ/ddYSIiEYjl5xHLz6N0AM6ym2nd71cAa4A1VVVVUdciIpK1Mi0cbgJmAjMrKyujrkVEJGtlVDi4+253X+vua2OxgXcYJiKSLTIqHEREJD3SfSprPlAAFIb3i8OHGj1Tp4cVEclC6T5y+BBwALgfyA9/PwAckeY6RETkNaQ1HNz9Dne3Lm4b01mHiIi8toy92I+Z7QQ29WLVfGAUsANo7dOiMpPej0Pp/ThI78WhsuX9OMLdR/b0pIwNh94ysxkEYyVmuvvaqOuJmt6PQ+n9OEjvxaFy7f3Q2UoiItKJwkFERDrJxXDYDXwj/Cl6PzrS+3GQ3otD5dT7kXN9DiIi0rNcPHIQEZEeKBxERKQThYOIiHSicBARkU4UDiIi0onCQUREOlE4iIhIJwoHERHpJKfCwczyzex6M9tpZvvN7F4zGxF1XVEws++Z2UozqzGzrWZ2q5kNj7quqJlZnpktNjM3s/FR1xMlMzvbzJ4ys1oz22VmC6KuKSpmNtrMfh9+duw1s4fMbHbUdfWnnAoH4P8B7wJOBOL/8e+MrpxItQIfBCqA2QTvxx1RFjRAfA6oj7qIqJnZmcAfgRsI/o2MB26LsqaILQCGAzMIpu1eAvzNzCzSqvpRTk2fYWabgOvc/Rfh/anAy8Akd+/NtSGyhpm9BbjH3cuiriUq4ZTM/wDOB54HJrj7K9FWFQ0zexJ4xN3/X9S1DARmthz4qbvfEt6fCawGRrr7rkiL6yc5c+RgZkOBicDS+DJ3XwfUEHxzznXzgReiLiIqZpYH3A58AdgXcTmRMrNS4I1AzMyeC5uUFpnZnKhri9D1wPlmNtLMioFPAI9nazBADoUDMCT8Wd1h+T4gZ78tA5jZ+cCngKuiriVCVwHb3f1PURcyAAwj+Gy4CLgEGAs8APw9/JKVi54guBJcFVALvAe4NNKK+lkuhcP+8Gd5h+VDCY4ecpKZXQDcCrzT3Z+Lup4omNk04Grg8qhrGSDi/1d+6e7L3b0J+G+gADglurKiER5VPgisJfj8KAG+DTxmZqOirK0/5Uw4uPs+YDNwQnyZmU0hOGpYHlVdUTKzjwA3A+9w94ejridCc4GRwAoz2wXEQ3K5mV0WXVnRcPdqYCPQsUPSu1iWC4YDk4Gb3L3G3Zvc/TaCz8+Toy2t/+RMOIRuAb5sZpPNrAz4HnC/u2+Mtqz0M7MrCc5EebO7PxF1PRG7B5gKHBfezg2XnwP8OqqiIrYA+IiZHWVmMeCLQCOwONqy0i/sV1gLXGZmpWYWM7OPEjRVZ+0Xy1jUBaTZdwnaU58FioB/EZzOmYt+DLQADyeejefugyOrKCLuXk/C6avhhyEEfRC10VQVuRsIPvweAooJzt56a3hUkYvOI+iU3kTQvPYycIG7r4+0qn6UU6eyiohIcnKtWUlERJKgcBARkU4UDiIi0onCQUREOlE4iIhIJwoHERHpROEgMgCY2SVm1hB1HSJxCgfJeWZ2R3hxn463nJyuWwRyb4S0SHceBj7QYVlrFIWIDAQ6chAJNLn79g63nQBmttHMrjOz28PLqu40s28mXgXMzMrN7BfhtQ8azOwJMztkUjYzmx5emnavmdWb2fNmNq/Dc04zs2Xh48+Y2fHpefkih1I4iCTnswSz+s4huCDQ1cCnEx7/JXAGcCHwemAdcH98SmczG0NwTYASgon9jgW+1WEfBeGyz4Tb2AfcHU4ZLZJWmltJcp6Z3UEwAWPHDuE/ufuHzGwjsMHd5yWs833gPe4+zcymE8za+SZ3fzB8PD4526/d/Wtm9i3gI8A0dz/QRQ2XEATMbHdfHi47FXgcXcZWIqA+B5HAYuCjHZYlzsj6ZIfHngC+EF4y8kiC6xw8Hn/Q3ZvD6zAfFS46geCykp2CIUELsCLh/tbw5yiC2UBF0kbhIBKod/eXI66h1d3bEu7HD+vVrCRpp390Isk5qcP9UwiamhqAfwNGcEU5oL1Z6WRgZbjoOeBUMxuUhlpFDpvCQSRQaGajO94SHp9jZl8zsxlm9iGC603/ECA84rgP+LmZnWVmRwG/ILiw1M/C9RcQXDTnPjM72cymmNm7Op6tJDJQqFlJJDAP2NZxYXgEAPAjYBqwFGgiuJLegoSnfhS4EfgDUBo+783uvgPA3bea2Vzg+8D9QD6wmuCsJ5EBR2crifQgPFvp5+7+3ahrEUkXNSuJiEgnCgcREelEzUoiItKJjhxERKQThYOIiHSicBARkU4UDiIi0onCQUREOlE4iIhIJ/8fDb9RGSoVbuYAAAAASUVORK5CYII=\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "# train the copy task\n",
    "dev_perplexities = train_copy_task()\n",
    "\n",
    "def plot_perplexity(perplexities):\n",
    "    \"\"\"plot perplexities\"\"\"\n",
    "    plt.title(\"Perplexity per Epoch\")\n",
    "    plt.xlabel(\"Epoch\")\n",
    "    plt.ylabel(\"Perplexity\")\n",
    "    plt.plot(perplexities)\n",
    "    \n",
    "plot_perplexity(dev_perplexities)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can see that the model managed to correctly 'translate' the two examples in the end.\n",
    "\n",
    "Moreover, the perplexity of the development data nicely went down towards 1."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# A Real World Example\n",
    "\n",
    "Now we consider a real-world example using the IWSLT German-English Translation task. \n",
    "This task is much smaller than usual, but it illustrates the whole system. \n",
    "\n",
    "The cell below installs torch text and spacy. This might take a while."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "#!pip install git+git://github.com/pytorch/text spacy \n",
    "#!python -m spacy download en\n",
    "#!python -m spacy download de"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Loading\n",
    "\n",
    "We will load the dataset using torchtext and spacy for tokenization.\n",
    "\n",
    "This cell might take a while to run the first time, as it will download and tokenize the IWSLT data.\n",
    "\n",
    "For speed we only include short sentences, and we include a word in the vocabulary only if it occurs at least 5 times. In this case we also lowercase the data.\n",
    "\n",
    "If you have **issues** with torch text in the cell below (e.g. an `ascii` error), try running `export LC_ALL=\"en_US.UTF-8\"` before you start `jupyter notebook`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "# For data loading.\n",
    "from torchtext import data, datasets\n",
    "\n",
    "if True:\n",
    "    import spacy\n",
    "    spacy_de = spacy.load('de')\n",
    "    spacy_en = spacy.load('en')\n",
    "\n",
    "    def tokenize_de(text):\n",
    "        return [tok.text for tok in spacy_de.tokenizer(text)]\n",
    "\n",
    "    def tokenize_en(text):\n",
    "        return [tok.text for tok in spacy_en.tokenizer(text)]\n",
    "\n",
    "    UNK_TOKEN = \"<unk>\"\n",
    "    PAD_TOKEN = \"<pad>\"    \n",
    "    SOS_TOKEN = \"<s>\"\n",
    "    EOS_TOKEN = \"</s>\"\n",
    "    LOWER = True\n",
    "    \n",
    "    # we include lengths to provide to the RNNs\n",
    "    SRC = data.Field(tokenize=tokenize_de, \n",
    "                     batch_first=True, lower=LOWER, include_lengths=True,\n",
    "                     unk_token=UNK_TOKEN, pad_token=PAD_TOKEN, init_token=None, eos_token=EOS_TOKEN)\n",
    "    TRG = data.Field(tokenize=tokenize_en, \n",
    "                     batch_first=True, lower=LOWER, include_lengths=True,\n",
    "                     unk_token=UNK_TOKEN, pad_token=PAD_TOKEN, init_token=SOS_TOKEN, eos_token=EOS_TOKEN)\n",
    "\n",
    "    MAX_LEN = 25  # NOTE: we filter out a lot of sentences for speed\n",
    "    train_data, valid_data, test_data = datasets.IWSLT.splits(\n",
    "        exts=('.de', '.en'), fields=(SRC, TRG), \n",
    "        filter_pred=lambda x: len(vars(x)['src']) <= MAX_LEN and \n",
    "            len(vars(x)['trg']) <= MAX_LEN)\n",
    "    MIN_FREQ = 5  # NOTE: we limit the vocabulary to frequent words for speed\n",
    "    SRC.build_vocab(train_data.src, min_freq=MIN_FREQ)\n",
    "    TRG.build_vocab(train_data.trg, min_freq=MIN_FREQ)\n",
    "    \n",
    "    PAD_INDEX = TRG.vocab.stoi[PAD_TOKEN]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Let's look at the data\n",
    "\n",
    "It never hurts to look at your data and some statistics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Data set sizes (number of sentence pairs):\n",
      "train 143116\n",
      "valid 690\n",
      "test 963 \n",
      "\n",
      "First training example:\n",
      "src: david gallo : das ist bill lange . ich bin dave gallo .\n",
      "trg: david gallo : this is bill lange . i 'm dave gallo . \n",
      "\n",
      "Most common words (src):\n",
      "         .     138325\n",
      "         ,     105944\n",
      "       und      41839\n",
      "       die      40809\n",
      "       das      33324\n",
      "       sie      33035\n",
      "       ich      31153\n",
      "       ist      31035\n",
      "        es      27449\n",
      "       wir      25817 \n",
      "\n",
      "Most common words (trg):\n",
      "         .     137259\n",
      "         ,      91619\n",
      "       the      73344\n",
      "       and      50273\n",
      "        to      42798\n",
      "         a      39573\n",
      "        of      39496\n",
      "         i      33524\n",
      "        it      32921\n",
      "      that      32643 \n",
      "\n",
      "First 10 words (src):\n",
      "00 <unk>\n",
      "01 <pad>\n",
      "02 </s>\n",
      "03 .\n",
      "04 ,\n",
      "05 und\n",
      "06 die\n",
      "07 das\n",
      "08 sie\n",
      "09 ich \n",
      "\n",
      "First 10 words (trg):\n",
      "00 <unk>\n",
      "01 <pad>\n",
      "02 <s>\n",
      "03 </s>\n",
      "04 .\n",
      "05 ,\n",
      "06 the\n",
      "07 and\n",
      "08 to\n",
      "09 a \n",
      "\n",
      "Number of German words (types): 15761\n",
      "Number of English words (types): 13003 \n",
      "\n"
     ]
    }
   ],
   "source": [
    "def print_data_info(train_data, valid_data, test_data, src_field, trg_field):\n",
    "    \"\"\" This prints some useful stuff about our data sets. \"\"\"\n",
    "\n",
    "    print(\"Data set sizes (number of sentence pairs):\")\n",
    "    print('train', len(train_data))\n",
    "    print('valid', len(valid_data))\n",
    "    print('test', len(test_data), \"\\n\")\n",
    "\n",
    "    print(\"First training example:\")\n",
    "    print(\"src:\", \" \".join(vars(train_data[0])['src']))\n",
    "    print(\"trg:\", \" \".join(vars(train_data[0])['trg']), \"\\n\")\n",
    "\n",
    "    print(\"Most common words (src):\")\n",
    "    print(\"\\n\".join([\"%10s %10d\" % x for x in src_field.vocab.freqs.most_common(10)]), \"\\n\")\n",
    "    print(\"Most common words (trg):\")\n",
    "    print(\"\\n\".join([\"%10s %10d\" % x for x in trg_field.vocab.freqs.most_common(10)]), \"\\n\")\n",
    "\n",
    "    print(\"First 10 words (src):\")\n",
    "    print(\"\\n\".join(\n",
    "        '%02d %s' % (i, t) for i, t in enumerate(src_field.vocab.itos[:10])), \"\\n\")\n",
    "    print(\"First 10 words (trg):\")\n",
    "    print(\"\\n\".join(\n",
    "        '%02d %s' % (i, t) for i, t in enumerate(trg_field.vocab.itos[:10])), \"\\n\")\n",
    "\n",
    "    print(\"Number of German words (types):\", len(src_field.vocab))\n",
    "    print(\"Number of English words (types):\", len(trg_field.vocab), \"\\n\")\n",
    "    \n",
    "    \n",
    "print_data_info(train_data, valid_data, test_data, SRC, TRG)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Iterators\n",
    "Batching matters a ton for speed. We will use torch text's BucketIterator here to get batches containing sentences of (almost) the same length.\n",
    "\n",
    "#### Note on sorting batches for RNNs in PyTorch\n",
    "\n",
    "For effiency reasons, PyTorch RNNs require that batches have been sorted by length, with the longest sentence in the batch first. For training, we simply sort each batch. \n",
    "For validation, we would run into trouble if we want to compare our translations with some external file that was not sorted. Therefore we simply set the validation batch size to 1, so that we can keep it in the original order."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_iter = data.BucketIterator(train_data, batch_size=64, train=True, \n",
    "                                 sort_within_batch=True, \n",
    "                                 sort_key=lambda x: (len(x.src), len(x.trg)), repeat=False,\n",
    "                                 device=DEVICE)\n",
    "valid_iter = data.Iterator(valid_data, batch_size=1, train=False, sort=False, repeat=False, \n",
    "                           device=DEVICE)\n",
    "\n",
    "\n",
    "def rebatch(pad_idx, batch):\n",
    "    \"\"\"Wrap torchtext batch into our own Batch class for pre-processing\"\"\"\n",
    "    return Batch(batch.src, batch.trg, pad_idx)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training the System\n",
    "\n",
    "Now we train the model. \n",
    "\n",
    "On a Titan X GPU, this runs at ~18,000 tokens per second with a batch size of 64."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "def train(model, num_epochs=10, lr=0.0003, print_every=100):\n",
    "    \"\"\"Train a model on IWSLT\"\"\"\n",
    "    \n",
    "    if USE_CUDA:\n",
    "        model.cuda()\n",
    "\n",
    "    # optionally add label smoothing; see the Annotated Transformer\n",
    "    criterion = nn.NLLLoss(reduction=\"sum\", ignore_index=PAD_INDEX)\n",
    "    optim = torch.optim.Adam(model.parameters(), lr=lr)\n",
    "    \n",
    "    dev_perplexities = []\n",
    "\n",
    "    for epoch in range(num_epochs):\n",
    "      \n",
    "        print(\"Epoch\", epoch)\n",
    "        model.train()\n",
    "        train_perplexity = run_epoch((rebatch(PAD_INDEX, b) for b in train_iter), \n",
    "                                     model,\n",
    "                                     SimpleLossCompute(model.generator, criterion, optim),\n",
    "                                     print_every=print_every)\n",
    "        \n",
    "        model.eval()\n",
    "        with torch.no_grad():\n",
    "            print_examples((rebatch(PAD_INDEX, x) for x in valid_iter), \n",
    "                           model, n=3, src_vocab=SRC.vocab, trg_vocab=TRG.vocab)        \n",
    "\n",
    "            dev_perplexity = run_epoch((rebatch(PAD_INDEX, b) for b in valid_iter), \n",
    "                                       model, \n",
    "                                       SimpleLossCompute(model.generator, criterion, None))\n",
    "            print(\"Validation perplexity: %f\" % dev_perplexity)\n",
    "            dev_perplexities.append(dev_perplexity)\n",
    "        \n",
    "    return dev_perplexities\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 0\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/rnn.py:38: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.2 and num_layers=1\n",
      "  \"num_layers={}\".format(dropout, num_layers))\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch Step: 100 Loss: 22.353386 Tokens per Sec: 16007.731248\n",
      "Epoch Step: 200 Loss: 34.410126 Tokens per Sec: 16368.906298\n",
      "Epoch Step: 300 Loss: 44.763870 Tokens per Sec: 16586.324787\n",
      "Epoch Step: 400 Loss: 57.584606 Tokens per Sec: 16717.486756\n",
      "Epoch Step: 500 Loss: 40.508701 Tokens per Sec: 16486.886104\n",
      "Epoch Step: 600 Loss: 51.919121 Tokens per Sec: 16529.862635\n",
      "Epoch Step: 700 Loss: 82.279633 Tokens per Sec: 16973.462052\n",
      "Epoch Step: 800 Loss: 35.026432 Tokens per Sec: 16724.939524\n",
      "Epoch Step: 900 Loss: 63.407204 Tokens per Sec: 16606.524355\n",
      "Epoch Step: 1000 Loss: 37.909828 Tokens per Sec: 19105.497130\n",
      "Epoch Step: 1100 Loss: 90.584244 Tokens per Sec: 19643.264684\n",
      "Epoch Step: 1200 Loss: 84.000832 Tokens per Sec: 19468.084935\n",
      "Epoch Step: 1300 Loss: 54.331242 Tokens per Sec: 19679.282614\n",
      "Epoch Step: 1400 Loss: 49.921040 Tokens per Sec: 19629.820942\n",
      "Epoch Step: 1500 Loss: 21.851797 Tokens per Sec: 19565.639729\n",
      "Epoch Step: 1600 Loss: 55.154270 Tokens per Sec: 19515.738007\n",
      "Epoch Step: 1700 Loss: 40.758137 Tokens per Sec: 19486.791554\n",
      "Epoch Step: 1800 Loss: 50.094219 Tokens per Sec: 19761.236905\n",
      "Epoch Step: 1900 Loss: 90.545143 Tokens per Sec: 19447.650965\n",
      "Epoch Step: 2000 Loss: 22.882494 Tokens per Sec: 19539.331538\n",
      "Epoch Step: 2100 Loss: 99.448174 Tokens per Sec: 19278.704892\n",
      "Epoch Step: 2200 Loss: 16.793839 Tokens per Sec: 19183.702688\n",
      "\n",
      "Example #1\n",
      "Src :  als ich 11 jahre alt war , wurde ich eines morgens von den <unk> heller freude geweckt .\n",
      "Trg :  when i was 11 , i remember waking up one morning to the sound of joy in my house .\n",
      "Pred:  when i was born years old , i was a <unk> of the <unk> of the <unk> .\n",
      "\n",
      "Example #2\n",
      "Src :  mein vater hörte sich auf seinem kleinen , grauen radio die <unk> der bbc an .\n",
      "Trg :  my father was listening to bbc news on his small , gray radio .\n",
      "Pred:  my father was on his <unk> , the <unk> of the <unk> .\n",
      "\n",
      "Example #3\n",
      "Src :  er sah sehr glücklich aus , was damals ziemlich ungewöhnlich war , da ihn die nachrichten meistens <unk> .\n",
      "Trg :  there was a big smile on his face which was unusual then , because the news mostly depressed him .\n",
      "Pred:  he was very interested in the way , what was pretty much more , and then it was the <unk> .\n",
      "\n",
      "Validation perplexity: 31.839708\n",
      "Epoch 1\n",
      "Epoch Step: 100 Loss: 4.451122 Tokens per Sec: 19110.156367\n",
      "Epoch Step: 200 Loss: 11.262838 Tokens per Sec: 19538.253630\n",
      "Epoch Step: 300 Loss: 55.240711 Tokens per Sec: 19584.509548\n",
      "Epoch Step: 400 Loss: 54.733456 Tokens per Sec: 19787.183104\n",
      "Epoch Step: 500 Loss: 38.923244 Tokens per Sec: 19385.772613\n",
      "Epoch Step: 600 Loss: 63.162933 Tokens per Sec: 19013.165752\n",
      "Epoch Step: 700 Loss: 47.323864 Tokens per Sec: 18863.104141\n",
      "Epoch Step: 800 Loss: 43.414978 Tokens per Sec: 19258.337491\n",
      "Epoch Step: 900 Loss: 87.750214 Tokens per Sec: 19179.949782\n",
      "Epoch Step: 1000 Loss: 39.787056 Tokens per Sec: 19110.748464\n",
      "Epoch Step: 1100 Loss: 78.177170 Tokens per Sec: 19272.044197\n",
      "Epoch Step: 1200 Loss: 37.122997 Tokens per Sec: 19194.535740\n",
      "Epoch Step: 1300 Loss: 26.103378 Tokens per Sec: 19337.967366\n",
      "Epoch Step: 1400 Loss: 78.804855 Tokens per Sec: 19018.413406\n",
      "Epoch Step: 1500 Loss: 61.593956 Tokens per Sec: 19259.272095\n",
      "Epoch Step: 1600 Loss: 81.611786 Tokens per Sec: 19259.527179\n",
      "Epoch Step: 1700 Loss: 28.692696 Tokens per Sec: 19230.891840\n",
      "Epoch Step: 1800 Loss: 84.163223 Tokens per Sec: 19071.272023\n",
      "Epoch Step: 1900 Loss: 36.782116 Tokens per Sec: 19209.383788\n",
      "Epoch Step: 2000 Loss: 56.666332 Tokens per Sec: 19127.522297\n",
      "Epoch Step: 2100 Loss: 5.576357 Tokens per Sec: 18957.458966\n",
      "Epoch Step: 2200 Loss: 38.791512 Tokens per Sec: 19166.811446\n",
      "\n",
      "Example #1\n",
      "Src :  als ich 11 jahre alt war , wurde ich eines morgens von den <unk> heller freude geweckt .\n",
      "Trg :  when i was 11 , i remember waking up one morning to the sound of joy in my house .\n",
      "Pred:  when i was 11 years old , i was a <unk> of the <unk> <unk> .\n",
      "\n",
      "Example #2\n",
      "Src :  mein vater hörte sich auf seinem kleinen , grauen radio die <unk> der bbc an .\n",
      "Trg :  my father was listening to bbc news on his small , gray radio .\n",
      "Pred:  my father was on his <unk> , in the little <unk> , the <unk> of the <unk> .\n",
      "\n",
      "Example #3\n",
      "Src :  er sah sehr glücklich aus , was damals ziemlich ungewöhnlich war , da ihn die nachrichten meistens <unk> .\n",
      "Trg :  there was a big smile on his face which was unusual then , because the news mostly depressed him .\n",
      "Pred:  he saw very happy , what was pretty much , and it was the <unk> of the <unk> .\n",
      "\n",
      "Validation perplexity: 19.906190\n",
      "Epoch 2\n",
      "Epoch Step: 100 Loss: 58.981544 Tokens per Sec: 19121.747106\n",
      "Epoch Step: 200 Loss: 34.874680 Tokens per Sec: 19689.768904\n",
      "Epoch Step: 300 Loss: 27.895102 Tokens per Sec: 19751.401628\n",
      "Epoch Step: 400 Loss: 52.931011 Tokens per Sec: 16369.447354\n",
      "Epoch Step: 500 Loss: 77.191933 Tokens per Sec: 16337.808093\n",
      "Epoch Step: 600 Loss: 65.645668 Tokens per Sec: 16307.871308\n",
      "Epoch Step: 700 Loss: 7.141161 Tokens per Sec: 16420.432824\n",
      "Epoch Step: 800 Loss: 76.990250 Tokens per Sec: 17512.558218\n",
      "Epoch Step: 900 Loss: 43.835995 Tokens per Sec: 16399.672659\n",
      "Epoch Step: 1000 Loss: 68.026192 Tokens per Sec: 16598.504664\n",
      "Epoch Step: 1100 Loss: 23.746111 Tokens per Sec: 16368.137311\n",
      "Epoch Step: 1200 Loss: 42.117832 Tokens per Sec: 16324.872475\n",
      "Epoch Step: 1300 Loss: 47.894409 Tokens per Sec: 16532.223380\n",
      "Epoch Step: 1400 Loss: 43.772861 Tokens per Sec: 16472.315811\n",
      "Epoch Step: 1500 Loss: 60.978756 Tokens per Sec: 16368.088307\n",
      "Epoch Step: 1600 Loss: 59.143227 Tokens per Sec: 16553.220745\n",
      "Epoch Step: 1700 Loss: 34.091373 Tokens per Sec: 16557.579342\n",
      "Epoch Step: 1800 Loss: 11.551711 Tokens per Sec: 16639.281663\n",
      "Epoch Step: 1900 Loss: 40.060520 Tokens per Sec: 16666.679672\n",
      "Epoch Step: 2000 Loss: 21.947863 Tokens per Sec: 16403.240568\n",
      "Epoch Step: 2100 Loss: 12.891315 Tokens per Sec: 16656.630033\n",
      "Epoch Step: 2200 Loss: 12.300262 Tokens per Sec: 16592.045153\n",
      "\n",
      "Example #1\n",
      "Src :  als ich 11 jahre alt war , wurde ich eines morgens von den <unk> heller freude geweckt .\n",
      "Trg :  when i was 11 , i remember waking up one morning to the sound of joy in my house .\n",
      "Pred:  when i was 11 years old , i was a <unk> of the <unk> of the <unk> .\n",
      "\n",
      "Example #2\n",
      "Src :  mein vater hörte sich auf seinem kleinen , grauen radio die <unk> der bbc an .\n",
      "Trg :  my father was listening to bbc news on his small , gray radio .\n",
      "Pred:  my father was on his little , <unk> , <unk> the <unk> of the bbc .\n",
      "\n",
      "Example #3\n",
      "Src :  er sah sehr glücklich aus , was damals ziemlich ungewöhnlich war , da ihn die nachrichten meistens <unk> .\n",
      "Trg :  there was a big smile on his face which was unusual then , because the news mostly depressed him .\n",
      "Pred:  he looked very happy to what was pretty much more , because it was the <unk> of the <unk> .\n",
      "\n",
      "Validation perplexity: 15.555337\n",
      "Epoch 3\n",
      "Epoch Step: 100 Loss: 36.178066 Tokens per Sec: 16064.364293\n",
      "Epoch Step: 200 Loss: 20.046204 Tokens per Sec: 16557.065342\n",
      "Epoch Step: 300 Loss: 53.514584 Tokens per Sec: 16375.767859\n",
      "Epoch Step: 400 Loss: 29.280447 Tokens per Sec: 16687.195842\n",
      "Epoch Step: 500 Loss: 64.491814 Tokens per Sec: 16491.438857\n",
      "Epoch Step: 600 Loss: 62.286755 Tokens per Sec: 16443.863308\n",
      "Epoch Step: 700 Loss: 60.861393 Tokens per Sec: 16303.304238\n",
      "Epoch Step: 800 Loss: 25.101744 Tokens per Sec: 16437.206262\n",
      "Epoch Step: 900 Loss: 41.884624 Tokens per Sec: 16712.862598\n",
      "Epoch Step: 1000 Loss: 65.880905 Tokens per Sec: 16406.042864\n",
      "Epoch Step: 1100 Loss: 34.799385 Tokens per Sec: 16257.804744\n",
      "Epoch Step: 1200 Loss: 57.244125 Tokens per Sec: 16403.685499\n",
      "Epoch Step: 1300 Loss: 6.766514 Tokens per Sec: 16262.412676\n",
      "Epoch Step: 1400 Loss: 31.528254 Tokens per Sec: 16723.894609\n",
      "Epoch Step: 1500 Loss: 4.534189 Tokens per Sec: 16512.533272\n",
      "Epoch Step: 1600 Loss: 50.852787 Tokens per Sec: 16820.837828\n",
      "Epoch Step: 1700 Loss: 30.657820 Tokens per Sec: 16574.791159\n",
      "Epoch Step: 1800 Loss: 75.787910 Tokens per Sec: 16441.350335\n",
      "Epoch Step: 1900 Loss: 23.563347 Tokens per Sec: 16836.284727\n",
      "Epoch Step: 2000 Loss: 10.594786 Tokens per Sec: 16522.362683\n",
      "Epoch Step: 2100 Loss: 40.561062 Tokens per Sec: 16508.617285\n",
      "Epoch Step: 2200 Loss: 15.348518 Tokens per Sec: 16624.360367\n",
      "\n",
      "Example #1\n",
      "Src :  als ich 11 jahre alt war , wurde ich eines morgens von den <unk> heller freude geweckt .\n",
      "Trg :  when i was 11 , i remember waking up one morning to the sound of joy in my house .\n",
      "Pred:  when i was 11 11 years old , i was a <unk> of the <unk> <unk> joy .\n",
      "\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Example #2\n",
      "Src :  mein vater hörte sich auf seinem kleinen , grauen radio die <unk> der bbc an .\n",
      "Trg :  my father was listening to bbc news on his small , gray radio .\n",
      "Pred:  my father was on his little , <unk> , <unk> , the <unk> of the bbc .\n",
      "\n",
      "Example #3\n",
      "Src :  er sah sehr glücklich aus , was damals ziemlich ungewöhnlich war , da ihn die nachrichten meistens <unk> .\n",
      "Trg :  there was a big smile on his face which was unusual then , because the news mostly depressed him .\n",
      "Pred:  he saw very happy , what was pretty much , because it was the <unk> .\n",
      "\n",
      "Validation perplexity: 13.563748\n",
      "Epoch 4\n",
      "Epoch Step: 100 Loss: 9.601490 Tokens per Sec: 16309.901017\n",
      "Epoch Step: 200 Loss: 13.329712 Tokens per Sec: 16693.352689\n",
      "Epoch Step: 300 Loss: 61.213333 Tokens per Sec: 16774.275779\n",
      "Epoch Step: 400 Loss: 37.759483 Tokens per Sec: 16628.037095\n",
      "Epoch Step: 500 Loss: 35.616104 Tokens per Sec: 16677.874896\n",
      "Epoch Step: 600 Loss: 58.753849 Tokens per Sec: 16452.736708\n",
      "Epoch Step: 700 Loss: 11.741160 Tokens per Sec: 16615.759446\n",
      "Epoch Step: 800 Loss: 24.230316 Tokens per Sec: 16804.673563\n",
      "Epoch Step: 900 Loss: 27.786499 Tokens per Sec: 16373.396939\n",
      "Epoch Step: 1000 Loss: 65.063515 Tokens per Sec: 16520.381173\n",
      "Epoch Step: 1100 Loss: 34.756481 Tokens per Sec: 16492.656502\n",
      "Epoch Step: 1200 Loss: 43.993877 Tokens per Sec: 17075.912389\n",
      "Epoch Step: 1300 Loss: 36.514729 Tokens per Sec: 16812.641454\n",
      "Epoch Step: 1400 Loss: 58.995735 Tokens per Sec: 16535.979640\n",
      "Epoch Step: 1500 Loss: 29.516464 Tokens per Sec: 16500.141569\n",
      "Epoch Step: 1600 Loss: 10.143467 Tokens per Sec: 16613.933279\n",
      "Epoch Step: 1700 Loss: 53.287037 Tokens per Sec: 16756.922926\n",
      "Epoch Step: 1800 Loss: 24.687494 Tokens per Sec: 16477.783348\n",
      "Epoch Step: 1900 Loss: 21.578268 Tokens per Sec: 16808.344988\n",
      "Epoch Step: 2000 Loss: 60.965946 Tokens per Sec: 16651.623717\n",
      "Epoch Step: 2100 Loss: 18.895075 Tokens per Sec: 16636.292649\n",
      "Epoch Step: 2200 Loss: 53.253704 Tokens per Sec: 16642.799323\n",
      "\n",
      "Example #1\n",
      "Src :  als ich 11 jahre alt war , wurde ich eines morgens von den <unk> heller freude geweckt .\n",
      "Trg :  when i was 11 , i remember waking up one morning to the sound of joy in my house .\n",
      "Pred:  when i was 11 years old , i was a <unk> of the <unk> <unk> joy .\n",
      "\n",
      "Example #2\n",
      "Src :  mein vater hörte sich auf seinem kleinen , grauen radio die <unk> der bbc an .\n",
      "Trg :  my father was listening to bbc news on his small , gray radio .\n",
      "Pred:  my dad listened on his little , <unk> radio <unk> the bbc of the bbc .\n",
      "\n",
      "Example #3\n",
      "Src :  er sah sehr glücklich aus , was damals ziemlich ungewöhnlich war , da ihn die nachrichten meistens <unk> .\n",
      "Trg :  there was a big smile on his face which was unusual then , because the news mostly depressed him .\n",
      "Pred:  he saw a happy very happy , which was pretty much , because he was the most famous <unk> .\n",
      "\n",
      "Validation perplexity: 12.664111\n",
      "Epoch 5\n",
      "Epoch Step: 100 Loss: 21.919912 Tokens per Sec: 16266.471497\n",
      "Epoch Step: 200 Loss: 31.320656 Tokens per Sec: 16527.955427\n",
      "Epoch Step: 300 Loss: 40.778984 Tokens per Sec: 16517.710752\n",
      "Epoch Step: 400 Loss: 63.466324 Tokens per Sec: 16770.294841\n",
      "Epoch Step: 500 Loss: 49.329956 Tokens per Sec: 16694.936223\n",
      "Epoch Step: 600 Loss: 52.290169 Tokens per Sec: 16755.442966\n",
      "Epoch Step: 700 Loss: 51.911785 Tokens per Sec: 16768.565847\n",
      "Epoch Step: 800 Loss: 25.005857 Tokens per Sec: 16813.186507\n",
      "Epoch Step: 900 Loss: 50.679825 Tokens per Sec: 17109.031968\n",
      "Epoch Step: 1000 Loss: 13.069316 Tokens per Sec: 16692.984251\n",
      "Epoch Step: 1100 Loss: 12.595688 Tokens per Sec: 16546.293379\n",
      "Epoch Step: 1200 Loss: 46.846031 Tokens per Sec: 16491.379305\n",
      "Epoch Step: 1300 Loss: 30.238283 Tokens per Sec: 16558.196936\n",
      "Epoch Step: 1400 Loss: 23.865877 Tokens per Sec: 16556.353749\n",
      "Epoch Step: 1500 Loss: 42.451859 Tokens per Sec: 16784.645679\n",
      "Epoch Step: 1600 Loss: 37.048477 Tokens per Sec: 16651.129133\n",
      "Epoch Step: 1700 Loss: 17.043219 Tokens per Sec: 16655.630464\n",
      "Epoch Step: 1800 Loss: 17.227308 Tokens per Sec: 16688.568658\n",
      "Epoch Step: 1900 Loss: 23.672441 Tokens per Sec: 16609.439477\n",
      "Epoch Step: 2000 Loss: 19.385946 Tokens per Sec: 16586.442474\n",
      "Epoch Step: 2100 Loss: 25.717686 Tokens per Sec: 16879.694187\n",
      "Epoch Step: 2200 Loss: 22.427767 Tokens per Sec: 16844.504307\n",
      "\n",
      "Example #1\n",
      "Src :  als ich 11 jahre alt war , wurde ich eines morgens von den <unk> heller freude geweckt .\n",
      "Trg :  when i was 11 , i remember waking up one morning to the sound of joy in my house .\n",
      "Pred:  when i was 11 years old , i was <unk> by the morning of joy .\n",
      "\n",
      "Example #2\n",
      "Src :  mein vater hörte sich auf seinem kleinen , grauen radio die <unk> der bbc an .\n",
      "Trg :  my father was listening to bbc news on his small , gray radio .\n",
      "Pred:  my father listened on his little , gray radio waves the bbc of the bbc .\n",
      "\n",
      "Example #3\n",
      "Src :  er sah sehr glücklich aus , was damals ziemlich ungewöhnlich war , da ihn die nachrichten meistens <unk> .\n",
      "Trg :  there was a big smile on his face which was unusual then , because the news mostly depressed him .\n",
      "Pred:  he saw a very happy ending , which was pretty unusual , since then they were <unk> .\n",
      "\n",
      "Validation perplexity: 12.246438\n",
      "Epoch 6\n",
      "Epoch Step: 100 Loss: 19.048712 Tokens per Sec: 19024.102757\n",
      "Epoch Step: 200 Loss: 31.636736 Tokens per Sec: 19387.779254\n",
      "Epoch Step: 300 Loss: 15.952754 Tokens per Sec: 19559.196457\n",
      "Epoch Step: 400 Loss: 24.849632 Tokens per Sec: 18968.450791\n",
      "Epoch Step: 500 Loss: 47.227837 Tokens per Sec: 19009.957585\n",
      "Epoch Step: 600 Loss: 8.887992 Tokens per Sec: 19024.581918\n",
      "Epoch Step: 700 Loss: 58.158920 Tokens per Sec: 16834.343585\n",
      "Epoch Step: 800 Loss: 32.257362 Tokens per Sec: 16725.454783\n",
      "Epoch Step: 900 Loss: 5.977044 Tokens per Sec: 16398.470679\n",
      "Epoch Step: 1000 Loss: 51.871101 Tokens per Sec: 16302.492231\n",
      "Epoch Step: 1100 Loss: 44.715164 Tokens per Sec: 16505.477988\n",
      "Epoch Step: 1200 Loss: 4.128096 Tokens per Sec: 19255.909773\n",
      "Epoch Step: 1300 Loss: 53.065189 Tokens per Sec: 19016.853318\n",
      "Epoch Step: 1400 Loss: 23.775473 Tokens per Sec: 18877.681861\n",
      "Epoch Step: 1500 Loss: 15.587101 Tokens per Sec: 18916.694718\n",
      "Epoch Step: 1600 Loss: 59.449795 Tokens per Sec: 19166.565245\n",
      "Epoch Step: 1700 Loss: 48.393402 Tokens per Sec: 18836.264938\n",
      "Epoch Step: 1800 Loss: 45.651253 Tokens per Sec: 18823.983316\n",
      "Epoch Step: 1900 Loss: 51.898994 Tokens per Sec: 19015.027947\n",
      "Epoch Step: 2000 Loss: 16.392334 Tokens per Sec: 19180.065119\n",
      "Epoch Step: 2100 Loss: 20.312500 Tokens per Sec: 19059.061076\n",
      "Epoch Step: 2200 Loss: 41.126842 Tokens per Sec: 19110.648056\n",
      "\n",
      "Example #1\n",
      "Src :  als ich 11 jahre alt war , wurde ich eines morgens von den <unk> heller freude geweckt .\n",
      "Trg :  when i was 11 , i remember waking up one morning to the sound of joy in my house .\n",
      "Pred:  when i was 11 , i was a <unk> of the <unk> <unk> joy .\n",
      "\n",
      "Example #2\n",
      "Src :  mein vater hörte sich auf seinem kleinen , grauen radio die <unk> der bbc an .\n",
      "Trg :  my father was listening to bbc news on his small , gray radio .\n",
      "Pred:  my father listened to his little , <unk> radio shack the <unk> of the bbc .\n",
      "\n",
      "Example #3\n",
      "Src :  er sah sehr glücklich aus , was damals ziemlich ungewöhnlich war , da ihn die nachrichten meistens <unk> .\n",
      "Trg :  there was a big smile on his face which was unusual then , because the news mostly depressed him .\n",
      "Pred:  he looked very happy , which was pretty unusual , and then they had the news <unk> .\n",
      "\n",
      "Validation perplexity: 12.045694\n",
      "Epoch 7\n",
      "Epoch Step: 100 Loss: 22.484320 Tokens per Sec: 19136.387726\n",
      "Epoch Step: 200 Loss: 54.793003 Tokens per Sec: 19562.003455\n",
      "Epoch Step: 300 Loss: 52.516510 Tokens per Sec: 19494.585192\n",
      "Epoch Step: 400 Loss: 25.631699 Tokens per Sec: 19127.415568\n",
      "Epoch Step: 500 Loss: 15.818419 Tokens per Sec: 18909.082434\n",
      "Epoch Step: 600 Loss: 40.660767 Tokens per Sec: 19063.824782\n",
      "Epoch Step: 700 Loss: 21.253407 Tokens per Sec: 19011.780769\n",
      "Epoch Step: 800 Loss: 9.494976 Tokens per Sec: 19032.447976\n",
      "Epoch Step: 900 Loss: 21.503059 Tokens per Sec: 19120.646494\n",
      "Epoch Step: 1000 Loss: 34.198826 Tokens per Sec: 18751.274337\n",
      "Epoch Step: 1100 Loss: 21.471136 Tokens per Sec: 19119.629059\n",
      "Epoch Step: 1200 Loss: 45.433662 Tokens per Sec: 19158.978952\n",
      "Epoch Step: 1300 Loss: 48.697639 Tokens per Sec: 18852.568454\n",
      "Epoch Step: 1400 Loss: 48.406239 Tokens per Sec: 19090.121092\n",
      "Epoch Step: 1500 Loss: 10.506186 Tokens per Sec: 18996.606224\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch Step: 1600 Loss: 22.061657 Tokens per Sec: 18889.519602\n",
      "Epoch Step: 1700 Loss: 11.148299 Tokens per Sec: 19179.133196\n",
      "Epoch Step: 1800 Loss: 16.580446 Tokens per Sec: 19184.709044\n",
      "Epoch Step: 1900 Loss: 20.219671 Tokens per Sec: 18889.205997\n",
      "Epoch Step: 2000 Loss: 21.245464 Tokens per Sec: 18869.151894\n",
      "Epoch Step: 2100 Loss: 29.567142 Tokens per Sec: 18825.496347\n",
      "Epoch Step: 2200 Loss: 22.790722 Tokens per Sec: 18923.950021\n",
      "\n",
      "Example #1\n",
      "Src :  als ich 11 jahre alt war , wurde ich eines morgens von den <unk> heller freude geweckt .\n",
      "Trg :  when i was 11 , i remember waking up one morning to the sound of joy in my house .\n",
      "Pred:  when i was 11 years old , i was <unk> a <unk> of the <unk> <unk> joy .\n",
      "\n",
      "Example #2\n",
      "Src :  mein vater hörte sich auf seinem kleinen , grauen radio die <unk> der bbc an .\n",
      "Trg :  my father was listening to bbc news on his small , gray radio .\n",
      "Pred:  my father listened to his little , <unk> radio <unk> the <unk> of the bbc .\n",
      "\n",
      "Example #3\n",
      "Src :  er sah sehr glücklich aus , was damals ziemlich ungewöhnlich war , da ihn die nachrichten meistens <unk> .\n",
      "Trg :  there was a big smile on his face which was unusual then , because the news mostly depressed him .\n",
      "Pred:  he looked very happy , which was pretty unusual , because he was going to put him in the <unk> .\n",
      "\n",
      "Validation perplexity: 11.837098\n",
      "Epoch 8\n",
      "Epoch Step: 100 Loss: 49.162842 Tokens per Sec: 19241.082862\n",
      "Epoch Step: 200 Loss: 35.163906 Tokens per Sec: 19633.028114\n",
      "Epoch Step: 300 Loss: 10.108455 Tokens per Sec: 17179.927672\n",
      "Epoch Step: 400 Loss: 12.883712 Tokens per Sec: 16510.876579\n",
      "Epoch Step: 500 Loss: 32.006828 Tokens per Sec: 16459.413702\n",
      "Epoch Step: 600 Loss: 21.056961 Tokens per Sec: 16640.683528\n",
      "Epoch Step: 700 Loss: 5.884560 Tokens per Sec: 16567.539919\n",
      "Epoch Step: 800 Loss: 17.562445 Tokens per Sec: 16529.548052\n",
      "Epoch Step: 900 Loss: 25.654568 Tokens per Sec: 16629.045928\n",
      "Epoch Step: 1000 Loss: 30.116678 Tokens per Sec: 16519.515326\n",
      "Epoch Step: 1100 Loss: 49.594883 Tokens per Sec: 16766.220937\n",
      "Epoch Step: 1200 Loss: 35.545147 Tokens per Sec: 16729.972737\n",
      "Epoch Step: 1300 Loss: 12.314122 Tokens per Sec: 16479.824355\n",
      "Epoch Step: 1400 Loss: 5.982590 Tokens per Sec: 16592.352361\n",
      "Epoch Step: 1500 Loss: 23.507740 Tokens per Sec: 16396.264595\n",
      "Epoch Step: 1600 Loss: 36.874157 Tokens per Sec: 16554.722618\n",
      "Epoch Step: 1700 Loss: 13.514697 Tokens per Sec: 16605.822594\n",
      "Epoch Step: 1800 Loss: 6.016938 Tokens per Sec: 16390.681327\n",
      "Epoch Step: 1900 Loss: 44.648132 Tokens per Sec: 16575.965569\n",
      "Epoch Step: 2000 Loss: 21.025373 Tokens per Sec: 16363.246501\n",
      "Epoch Step: 2100 Loss: 32.213993 Tokens per Sec: 16395.313089\n",
      "Epoch Step: 2200 Loss: 29.033810 Tokens per Sec: 16528.855537\n",
      "\n",
      "Example #1\n",
      "Src :  als ich 11 jahre alt war , wurde ich eines morgens von den <unk> heller freude geweckt .\n",
      "Trg :  when i was 11 , i remember waking up one morning to the sound of joy in my house .\n",
      "Pred:  when i was 11 years old , i was <unk> a <unk> of the <unk> <unk> joy .\n",
      "\n",
      "Example #2\n",
      "Src :  mein vater hörte sich auf seinem kleinen , grauen radio die <unk> der bbc an .\n",
      "Trg :  my father was listening to bbc news on his small , gray radio .\n",
      "Pred:  my father listened to his little , gray radio shack , the radio of the bbc .\n",
      "\n",
      "Example #3\n",
      "Src :  er sah sehr glücklich aus , was damals ziemlich ungewöhnlich war , da ihn die nachrichten meistens <unk> .\n",
      "Trg :  there was a big smile on his face which was unusual then , because the news mostly depressed him .\n",
      "Pred:  he looked very happy , which was pretty unusual , because he was the news of the most famous .\n",
      "\n",
      "Validation perplexity: 11.868392\n",
      "Epoch 9\n",
      "Epoch Step: 100 Loss: 33.819195 Tokens per Sec: 16155.433696\n",
      "Epoch Step: 200 Loss: 26.771244 Tokens per Sec: 16447.243194\n",
      "Epoch Step: 300 Loss: 22.235714 Tokens per Sec: 16557.847083\n",
      "Epoch Step: 400 Loss: 16.233931 Tokens per Sec: 16802.777289\n",
      "Epoch Step: 500 Loss: 34.811615 Tokens per Sec: 16637.208199\n",
      "Epoch Step: 600 Loss: 11.960271 Tokens per Sec: 16478.541533\n",
      "Epoch Step: 700 Loss: 32.807648 Tokens per Sec: 16526.645827\n",
      "Epoch Step: 800 Loss: 25.779436 Tokens per Sec: 16572.304586\n",
      "Epoch Step: 900 Loss: 18.101871 Tokens per Sec: 16472.573763\n",
      "Epoch Step: 1000 Loss: 34.465992 Tokens per Sec: 16489.131609\n",
      "Epoch Step: 1100 Loss: 47.311241 Tokens per Sec: 16501.563937\n",
      "Epoch Step: 1200 Loss: 22.709623 Tokens per Sec: 16416.828638\n",
      "Epoch Step: 1300 Loss: 45.883862 Tokens per Sec: 16338.132985\n",
      "Epoch Step: 1400 Loss: 21.321081 Tokens per Sec: 16680.505744\n",
      "Epoch Step: 1500 Loss: 11.126824 Tokens per Sec: 16636.646687\n",
      "Epoch Step: 1600 Loss: 32.759712 Tokens per Sec: 16440.968759\n",
      "Epoch Step: 1700 Loss: 19.354910 Tokens per Sec: 16476.318234\n",
      "Epoch Step: 1800 Loss: 14.631118 Tokens per Sec: 16490.663260\n",
      "Epoch Step: 1900 Loss: 2.233373 Tokens per Sec: 16390.177497\n",
      "Epoch Step: 2000 Loss: 42.503407 Tokens per Sec: 16498.365808\n",
      "Epoch Step: 2100 Loss: 35.935966 Tokens per Sec: 16257.764127\n",
      "Epoch Step: 2200 Loss: 37.685387 Tokens per Sec: 16498.916279\n",
      "\n",
      "Example #1\n",
      "Src :  als ich 11 jahre alt war , wurde ich eines morgens von den <unk> heller freude geweckt .\n",
      "Trg :  when i was 11 , i remember waking up one morning to the sound of joy in my house .\n",
      "Pred:  when i was 11 , i was a <unk> of <unk> <unk> joy .\n",
      "\n",
      "Example #2\n",
      "Src :  mein vater hörte sich auf seinem kleinen , grauen radio die <unk> der bbc an .\n",
      "Trg :  my father was listening to bbc news on his small , gray radio .\n",
      "Pred:  my father listened to his little , gray radio shack the bbc of the bbc .\n",
      "\n",
      "Example #3\n",
      "Src :  er sah sehr glücklich aus , was damals ziemlich ungewöhnlich war , da ihn die nachrichten meistens <unk> .\n",
      "Trg :  there was a big smile on his face which was unusual then , because the news mostly depressed him .\n",
      "Pred:  he looked very happy , which was pretty unusual since then , they were <unk> the <unk> .\n",
      "\n",
      "Validation perplexity: 11.886973\n"
     ]
    }
   ],
   "source": [
    "model = make_model(len(SRC.vocab), len(TRG.vocab),\n",
    "                   emb_size=256, hidden_size=256,\n",
    "                   num_layers=1, dropout=0.2)\n",
    "dev_perplexities = train(model, print_every=100)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAY8AAAElCAYAAAAcHW5vAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAAIABJREFUeJzt3Xl8XHW9//HXJ1uTpkvapCmlC23pgoAtS1lboEUERQU38IKgoqKALCqoV+8VERVFEEEUQUQRVJYrcO8PZce20IWlLdCytbSldG+atkmbptk/vz/OmXY6TUlOOslJZt7Px+M8MnPmzJlPJu2853y/3/M95u6IiIhEkRN3ASIi0vMoPEREJDKFh4iIRKbwEBGRyBQeIiISmcJDREQiU3hIZGZ2jZl50rLJzGaZ2cdiqsfN7L87ad/XmFlT0v2ScN2Ezni9rtbK3zJ5+WuMdc0ws2fien1pW17cBUiP1QxMCW8PBq4E/mlmp7r70/GVlXZ/BB5Pul8C/AhYCiyMpaL0S/5bJtvY1YVIz6HwkA5z9xcSt81sOrASuBzYp/Aws17uXr+P5aWFu68GVsddx75oz/uZ/LcUaQ81W0lauPtWYAkwOrHOzPqa2a/NbJWZ1ZvZYjO7MPl5iWYhM5toZjPNrBb4ZfiYm9nVZnadmW0ws+1m9pCZDW6rHjM70sweN7Pq8HlPmtkhSY8fFtZ0dcrzHjazNWZWmlxfeHsk8G646b1JzTtTzex/zeylVuo4OdzmuPepdYaZPWNm55rZEjOrM7OXzOzoVrb9nJnNM7MdZlZpZneZ2YCkx0eGr3ehmd1qZhXAhrber7aY2Qoz+6OZfdPMVoav/6yZjU3ZrpeZ/SL8mzeY2TtmdqWZWcp2Q8L9rQv/DsvM7LpWXvdjZrbQzGrNbL6ZnbCvv4ukibtr0RJpAa4BmlLW5QHrgKfC+/nAHIIPrkuAUwhCoRm4KGVfzQTNQN8GpgHHhI85wbf+p4GPA18BKoE5Ka/twH8n3Z8E7Aif9yngDGAWsAnYP2m77wCNSa/3VaAF+HBrvyvQK9yfEzRdHRsu/YCPhusnpNR2H7CojfdzRvjeLQXOCV9jPlANDEra7hthfbcBpwFfBNYAs4GccJuRYR1rgb+FdX2yrb9l+PdLXSxpuxXh32JBWN85BEH6LtArabsHgAbgB8CpwK/Cen6WtE1puL914b+Nk4EvAX9MeU/WAq8B/wGcDrwMVAElcf8f0OIKDy3Rl1Y+cPYHbg8/JC4Mt/lC+EF3VMpz7ww/NHKS9uXAl1t5ncSHYPKH0xnh+lNTtksOj38DrwJ5Sev6EQTP9UnrcoDpwDvA4UANcFNrv2vS/cSH83kp2+WEH6S3Jq0rBeqAy9t4P2eQEjxAOUEAXhfe70MQJremPHdy+NyPptQ3M8Lf0veyJIf8irCe8qR1E8Ltvhbe/2B4/zspr3FH+NyS8P5PCUL7oDbekzpgRNK6I8L9nx33/wEtrmYr6bBcgg+ARoJvv+cB17j7neHjpxE0Y71iZnmJBXgS2A8Yk7K//7eX13nUd2+vfxSoJ/jGvwczKwJOJPgGTNLr1gJzgZ3NHu7eQhByZeFjy4Hvt/2r7ync153AeWENhPt24N527GKJu+/sgHf3CuB5dv2exxEE4H0p7+eLwLbk3yv0aITym4GjWlkeStnuubCuRI0LCYI3UeOJ4c+/pzzvPqAQSDTDnQLMcve326jrDXdfmXw//DmijedJF1CHuXRUM8GHhgNbgJXu3pT0eDkwniBcWlOadLvF3Sv3sl1F8h13dzPbCAzZy/YDCYLtunBJtSRlf6vMbCZwJnCH71tH/V0E3+Q/SxAYXwUecvct7XhuRSvrNhA0wUHwfkLQRNWa0pT7kfo53H1eOzbbW42Jv8WApHXJ1qc8XkoQem3Z7X1z9/qw66SwHc+VTqbwkA5r4wNnM7CY4IikNcnfOt/vugDlyXfCjtdBBE1frakiaC67ifDoI0Vdyv7OIgiOBcCPzexhd9/bvt+Xu28ws/8FLjSz5cDBBG367VHeyrrB7Po9N4c/zyHoG0mVOqy2M661sLca3wlvb0latyZpm/3Cn4nfoRIYmvbqpEup2Uo6y5PAAcAmd5/XylLTzv18wsx6Jd8n6LhudWipu28n+Hb+wb287uuJbc1sKEF7/J8JOuprgD+njgxKkTgy2du33zsImpB+QdAUNbPtXxGAccknHppZebifxO85O6xv1F5+r/fa+Tr74sSwrkSNE4CxSTU+F/78j5TnfY4gtF8O7z8DTDGzcZ1Yq3QyHXlIZ7kX+DIw3cxuJGivLgYOAo5198+2cz8tBCcf3kLwjfZ6YK67P/U+z/k2MNPM/gn8haC5ZTBB5/I77v7bMCD+QvBt+XJ3rzGz8wk6ai8DfrOXfW8g+AZ9jpktJugIXuzu28LH/03wTXwK8N12/o4QNO08bGY/DPf5Q4JRS7+GYCi0mX0PuNnMhgBPEfTjjCAY1fRbd58T4fV2Y2at9SFtc/c3ku5vAp4ws58QBPh1BB3pd4c1LjKzB4HrzKwAmBfW9nWCjv+qcD+/Bs4HZpjZtQRHqMOAE9z9ax39HaRrKTykU7h7o5mdCvwX8E2CD7kqgg+K+yPs6g8E3/L/RDDi6AnaaApy93nhh+E1BKPAigk+nF8gGL4K8C1gKsEHVk34vOfN7HrgejN7xt3fbGXfLWb2JYIPzqfC2qYRhE6iT+aRcP9/ifB7vkHQZ3ItwXv1GsGIsp3NUe5+m5mtJhhi/OVw9SqCb/Lv0nG5BAMGUr3I7gMTngBeJwjWQQRHQxe7e0PSNl8If4dLCAL7vbDem5J+j81mdjzBe3gtwUCA1UT7dyExM3ddhla6JzNz4Ifu/tO4a4nCzBYBb7v7We3cfgbBcOBTOrWwfWBmK4Bn3P2rcdci3YOOPETSIOyXOYLgZMZDATW/SEZTeIikxxCCM+o3A99199aagUQyhpqtREQkMg3VFRGRyBQeIiISWcb2eZSVlfnIkSPjLkNEpEeZP39+pbsPamu7jA2PkSNHMm9ee6brERGRBDNr12wFarYSEZHIFB4iIhKZwkNERCJTeIiISGQKDxERiUzhISIikWXsUN2OemNtNU+9sYGhA4o4e9LwuMsREemWdOSR4r6XVnLLs+/w9xdXxl2KiEi3pfBIMXVccJXN11ZXsammvo2tRUSyk8IjxfFjSinIzcEdnn+nMu5yRES6JYVHit4FeRwzeiAA0xdXxFyNiEj3pPBoxdTxQdPVc0s20tyi652IiKRSeLRi6vhgQskttY0sXF0VczUiIt2PwqMVo8uKGT6wCIDpizfGXI2ISPej8GiFmTEtbLqaqX4PEZE9KDz2ItF09drqaio1ZFdEZDcKj704bnQZBXnB2/PcEjVdiYgkU3jsRVFBLseOLgXU7yEikkrh8T6mhU1Xz7+jIbsiIskUHu8jcb5HVW0jr67SkF0RkQSFx/sYVVbMAaW9AZihUVciIjspPNqQGLI7Q/0eIiI7dWl4mNnPzOxdM9tqZhVm9g8zG5H0+BfMbJmZ1ZrZi2Z2ZFfW15qTwn6PRWuqqdhWF3M1IiLdQ1cfedwLHObu/YCRwErgfgAzmwL8HrgYGAA8BDxmZv26uMbdHDe6lF47h+xqll0REeji8HD3t929OrxrQAswPrx/IfCwuz/l7vXADUA98KmurDFVYX4uxx0YDNlVv4eISKDL+zzM7FwzqwZqgCuAa8KHJgLzE9u5uwOvhOvbu+9SMxtnZuOamprSVvPUcUHT1XNLNtLU3JK2/YqI9FRdHh7u/nd37w8MIQiOReFDfYHqlM2rgCjNVpcBi4HFFRXpO0pIDNndWtekIbsiIsQ42srd1wN3Av80s4HANqB/ymYlwNYIu72VoBlsfHl5eVrqBBhZVsyosmJAF4gSEYH4h+rmAcXA/sBrwBGJB8zMgMPC9e3i7pvcfYm7L8nLy0troYmJEjVkV0SkC8PDzHLM7FIzKw/vDwN+B6wA3iY4Cvm0mX3IzAqAK4FC4JGuqvH9JJqu3li7lYqtGrIrItmtq488TgdeN7PtwItALXCKuze5+yzgEoIQqQbOBk539yjNVp3mmFEDKcwP3q4ZmmVXRLJcl4WHu7e4++nuXu7uxe4+1N0/7+7Lkra5x91Hu3uRux/t7vPfb59dqTA/l+MPLANgppquRCTLxd3n0aMk+j2ee0dDdkUkuyk8Ipg6Luj32FbXxIKVGrIrItlL4RHBiNLejB6kIbsiIgqPiDTLroiIwiOyRL/HW+u2sr5aQ3ZFJDspPCI6etRAivJzAZi5RE1XIpKdFB4R9crLZfKYxCy7aroSkeyk8OiAk8J+j1nvVNKoIbsikoUUHh2QmKJ9W30T89/bEnM1IiJdT+HRAcMH9mZMeR9ATVcikp0UHh2UOPrQ1QVFJBspPDpo2kFBv8fb67exrnpHzNWIiHQthUcHTRo5gN4FwZBdNV2JSLZReHRQMGQ3mGVXTVcikm0UHvsgcbb57KWbaGjSkF0RyR4Kj32QuLpgTX0T897bHHM1IiJdR+GxD4aWFDFucDBkVxeIEpFsovDYR4mjD03RLiLZROGxjxL9Hks21LCmSkN2RSQ7KDz20aQDBlK8c8iujj5EJDsoPPZRQV4OU8Ymhuyq30NEsoPCIw0S/R5zllZS39QcczUiIp1P4ZEGiX6P7Q3NzFuhWXZFJPMpPNJgSP8iDtqvL6B+DxHJDgqPNDkpPPqYrn4PEckCCo80mRb2eyytqGH1ltqYqxER6VwKjzQ58oAB9O2VB2jUlYhkPoVHmuTn5miWXRHJGgqPNJp20K5ZdjVkV0QymcIjjU4aF/R77Ghs5qV3NcuuiGQuhUca7de/kA8M6Qeo30NEMpvCI82m7hyyq34PEclcCo80SwzZXb5xO6s2a8iuiGQmhUeaHTGihL6FiSG7OvoQkcyk8EizvNwcTghn2dXZ5iKSqRQenWDnLLvLKqlr1JBdEck8Co9OMHVc0Gle19jCixqyKyIZSOHRCcr7FXLI/okhu+r3EJHMo/DoJIkhuzPV7yEiGUjh0UkS/R7LK7fz3qbtMVcjIpJeXRYeZna9mb1hZlvNbK2Z3WlmA5Me/5KZtZhZTdJyX1fVl26HDy+hX6Fm2RWRzNSVRx7NwHlAKTARGAbcnbLNcnfvk7Sc04X1pVVebg4njNPZ5iKSmbosPNz9B+7+irs3uvtG4BZgajpfw8xKzWycmY1rampK5647JHG2+dxlmzRkV0QySqTwMLOXzexCM+uThtf+EPBayrrhZrbezFaZ2f1mNiriPi8DFgOLKyri/7Z/UnjkUd/UwgvLN8VcjYhI+kQ98pgOXAusM7O7zOyYjryomX0GuAi4Imn1c8AHgf2Bo4A64GkzK46w61uB8cD48vLyjpSWVoP69uKDQ/sD6vcQkcwSKTzc/bvAcOCLwH7AbDNbZGaXm9mA9uzDzM4C7gTOcPcFSfte7u5L3L3F3dcDFxIEybER6tsU7mNJXl5ehN+s8ySG7Op8DxHJJJH7PNy9yd0fdvePAQcADwPXA2vM7G9mdtTenmtmFwB3AJ9w9+ltvVS4WNQau5NEeKzYVMu7lRqyKyKZocMd5mZ2IHAp8DVgB/BHoJDgaOTqVra/HLgROM3dZ7fy+MfMbJgFBgK/AyqBFzpaY3dw2PABlPTOB3T0ISKZI2qHeS8z+7yZTSfomD4B+B6wv7tf7u6fAT4JXNnK028B+gHTk8/lSHp8KvASUAO8QTCk98PuXrPHnnqQ3BzjhLGJIbvq9xCRzBC1Y2A9QVPSX4FL3f2NVraZA+wxG6C7v2/zk7t/B/hOxHp6hGnjB/Hoa2t5YfkmdjQ0U1SQG3dJIiL7JGqz1beAoeFRRmvBgbtXuXvUIbYZ7cRwyG6DhuyKSIaIGh4n0srRipkVm9mf0lNS5inr04sJw4IhuzrbXEQyQdTw+CJQ1Mr6IuAL+15O5kpMlDhj8UbcPeZqRET2TdTwMII+j10rzAyYAqg3+H0khuyu3FzLcg3ZFZEerl3hEc5220wQHOvNrDmxAE3AQwSd6LIXE4eVMGDnkF3lrIj0bO0dbXUOwVHH3wmmFalOeqwBeNfdX01zbRklN8c4cdwg/u/VtcxYXMFXpmhMgYj0XO0KD3d/AMDM1gGz3T3+KWt7oKnjg/B4cflmahua6F3QPaZQERGJqs1mKzNLnmHwLWCgmZW3tnRemZnhxLGDMIOG5hbmLtOQXRHpudrT57EuKRjWA+taWRLr5X2U9unFhGElgIbsikjP1p52k5PZdcb4yaSMtpJopo0fxGurqnYO2Q0Gq4mI9Cxthoe7z0y6PaNTq8kCU8eXc/Mz77B6yw6WbdzOmPJ0XFdLRKRrRZ0Y8aq9rC80szvSU1JmmzC0P6XFBYBm2RWRnivqSYLfN7PHzWxQYoWZTQAWkObrkWeqnHDILuh8DxHpuaKGx+FAH2ChmZ0WXqPjRYKp1I9Id3GZKnG2+UvvbmZ7vUY9i0jPE/UytCuBk4D/AR4juLjTV9z9S+6uOTfaKXnI7hwN2RWRHqgjVxKcBpxF0FS1Hfi8mZWltaoMN6C4gMOGa8iuiPRcUTvMfwE8DtwJHAscBpQAi8zs1PSXl7mmhbPsztQsuyLSA0U98vg8waVhr3b3Znd/j+BStHcCj6a9ugyW6PdYU7WDpRU9+kq7IpKFoobHxOTzPgDcvcXdrwZOSV9Zme/Q/ftT1icYsqumKxHpaaJ2mG8GMLNSMzvGzHolPfZ8uovLZBqyKyI9WdQ+jz5mdh/BhZ/mAEPD9XeY2Y86ob6Mlri64MsrNlOjIbsi0oNEbbb6OTAaOAbYkbT+n8Cn0lVUtjhxbBk5Bo3NzuyllXGXIyLSblHD4wzgCnd/md0nSHyLIFQkgpLeBRw+YgCgpisR6VmihscgYEMr64sIrjQoEU0bn+j3qNCQXRHpMaKGx0JaH1X1eeDlfS8n+yT6PdZV17Fkg4bsikjPEPU6qNcA/zCzYUAucI6ZfYDgjPMPp7m2rHDwkH6U9elFZU090xdXMH6/vnGXJCLSpqhDdZ8APkkwv1UL8F/AAcBH3f259JeX+XJybOcJg5qiXUR6ishzW7n7M+4+1d37uHtvdz/B3f/dGcVli0R4zFuxhW11jTFXIyLSto5MjChpdsKYQeTmGE0tGrIrIj1Dm+FhZjvMrLY9S1cUnIn6987niBHBLLsasisiPUF7OswvZvdzOqQTTB1fzssrtjAjnGXXTCOfRaT7ajM83P3uLqgj600dP4gbnlzM+q11vL1+Gx8Y0i/ukkRE9qpDfR5mdryZfTVcjk93Udno4CH9KO8bzDOpWXZFpLuLOjHicDObC8wCfhkus8zsBTMb3hkFZguz5CG76vcQke4t6pHHnQRNXYe4+0B3HwgcQjA1yZ3pLi7bJM42n//eFrZqyK6IdGNRw+Mk4GJ3fyuxIrx9KXBiOgvLRlPGlpGbYzS3OLPe0ZBdEem+oobHOqC1C080A2qo30f9CvM58oDELLt6O0Wk+4oaHlcDN5vZiMSK8PaNwA/TWVi2Su730Cy7ItJdRQ2P/wImAcvNbLWZrQaWA0cD3zezNxNLugvNFtPCfo+KbfW8uW5rzNWIiLQu6qy693f0hczseuDjwHCgBvgX8L3EddHDbb4A/AgYAiwCLnH3+R19zZ7ooP36sl+/QtZvrWPG4o0csn//uEsSEdlDu8PDzHKB6cBCd6/qwGs1A+cBrwMlwD3A3QRXJ8TMpgC/J7ic7UzgCuAxMxvr7lnzFTwxZPf+l1cxY3EF35g2Ju6SRET20O5mK3dvBp4GBnTkhdz9B+7+irs3uvtG4BZgatImFwIPu/tT7l4P3ADUk4XXRk/0eyxYWUV1rYbsikj3E7XP4y1gWJpe+0PAa0n3JwI7m6g86C1+JVzfLmZWambjzGxcU1Nrg8J6hsljysgLh+w+v1QnDIpI9xM1PK4CbjCzo8NmrA4xs88AFxE0TSX0BapTNq0CokzydBmwGFhcUdFzh7r2Lcxn0sjgAO9305exo6E55opERHYXNTweJRhtNReo68iU7GZ2FsHZ6Ge4+4Kkh7YBqb3DJUCU/o5bgfHA+PLy8ghP634uO3ksZvDWuq381yOLNGxXRLqVqKOtLtqXFzOzC4BfAZ9w99kpD78GHJG0rQGHAQ+3d//uvgnYBDBp0qR9KTV2k8eUcdWp47nhycU8/MoaJg4v4YvHj4y7LBERIGJ4uPtfOvpCZnY5wTDc09z95VY2uRN4wsz+AjwPXA4UAo909DV7uotPOpDXVlXx1Jsb+Mk/3+SQ/fsxaeTAuMsSEYk+JbuZlZvZlWb2ezMrC9dNNrNRbTz1FoL+i+lmVpNYEg+6+yzgEoIQqQbOBk7PpmG6qXJyjF+dPZHRZcU0tTiX/G0BFVvr4i5LRCTylOyHA28DFwBfYVdn9oeBn77fc93d3D3f3fskLynb3OPuo929yN2PzrYTBFvTtzCfO84/kt4FuVRsq+cbf19AY3NL3GWJSJaLeuTxK+AP7n4owTkYCU8Ck9NWlexm7OC+3PDZYMTyyyu28LN/vdXGM0REOlfU8DgC+GMr69cCg/e9HNmbj00YwtdOHA3A3XNW8Mgrq2OuSESyWdTwaAKKW1l/ILC5lfWSRt89bTzHjS4F4PsPL+LNtVnbHSQiMYsaHk8A3wmH0QK4mQ0AriU4B0Q6UV5uDreeezhD+hdS19jCRX+dr+lLRCQWHTnD/EhgGcEw2oeAdwlO5vtBekuT1pT16cXvzzuSgtwcVm6u5ZsPvEJLi04gFJGuFTU8tgBHERxp3AG8AFwJTEqeWl0612HDS/jxmYcAMH3xRm559p2YKxKRbNOukwTNbCDwF+AjBIHzAvB5d1/ReaXJ+znn6BG8urKKB+at4pZn32HCsP586AMasyAiXaO9Rx4/A44hOEP8OwQjq27vrKKkfX585iFMGBZMB/bNB15lReX2mCsSkWzR3vD4KPAVd7/O3W8iuIDTKWYWdW4sSaPC/Fx+f96RDCwuYFtdExf9dT61DT13KnoR6TnaGx5D2f1aG28CDcD+nVGUtN/QkiJuPedwcgzeXr+N7z+sGXhFpPO1NzxygdQxoc3heonZ5DFlfPcjBwHwf6+u5c+zV8RbkIhkvCjNTveZWUPS/ULgz8nX8XD309NWmUTy9RNH89qqKh5/fT3XPfYWhw7tz9GjNAOviHSO9h55/AVYBWxIWv5KcI5H8jqJiZlxw1kTOXDQrhl4N2gGXhHpJJap7eOTJk3yefPmxV1Gl1taUcMnfzebmvomjhhRwv1fO46CvMgz74tIljKz+e7e5tX09KmSYcaU9+HGsyYAsGBlFT/915sxVyQimUjhkYE+cugQLjrpQADumfseD83XDLwikl4Kjwx11anjmDKmDIAfPLKI19dUx1yRiGQShUeGysvN4TfnHM7QkiLqm1q4+G/zqaptaPuJIiLtoPDIYAOLC/j9eUdQkJfDqs07uPz+V2nWDLwikgYKjww3YVgJPz3zUACeW7KRm59ZEnNFIpIJFB5Z4OyjhnPO0SMAuPXfS3n6TZ2SIyL7RuGRJa4542AmDi8B4NsPvMq7moFXRPaBwiNL9MrL5fbzjqC0uIBt9U18/d55bK/XDLwi0jEKjywypH8Rt54bzMC7ZEMN33tooWbgFZEOUXhkmeMPLOP7H/0AAP9cuI67Zr0bc0Ui0hMpPLLQV08YxccmDAHg54+/zdxlm2KuSER6GoVHFjIzfvmZCYwt70Nzi3PZfQtYV70j7rJEpAdReGSp4l553H7+kfTtlUdlTQMX/3UB9U3NcZclIj2EwiOLHTioD786eyIAr66q4tpHNQOviLSPwiPLnXrIflw6bQwAf3txJQ/OWxVzRSLSEyg8hG99eBwnjhsEwH//7+ssWq0ZeEXk/Sk8hNwc45bPHcawAUU0NLVw0V/ns2W7ZuAVkb1TeAgAA4oLuP28I+mVl8Oaqh1cfv8rmoFXRPZK4SE7HTq0Pz/71AcBeP6dSn711OKYKxKR7krhIbv57JHDOO/YYAbe22Ys48k31sdckYh0RwoP2cPVHz+Ew0cEM/Be+eBrLNtYE3NFItLdKDxkDwV5Ofz+80dS1qeAmvomvn7vfCq21cVdloh0IwoPadV+/Qv57blHkJtjLK2o4eQbZ/KH55bR0NQSd2ki0g0oPGSvjh1dyk1nT6RvYR419U1c99jbfOTm55j+dkXcpYlIzLo0PMzsP8zseTPbamZNKY9NNTM3s5qkZU5X1id7OvOwocy4airnHD0cM1heuZ0L7n6ZC/78EsvVFyKStbr6yGMLcBvwzb083uzufZKW47uwNtmL0j69+PmnJ/DopVM4auQAAKYv3shpNz/HdY+9xba6xpgrFJGu1qXh4e5Puvt9wPKufF1Jj0OH9ufBrx/Hb845nCH9C2lsdv7w3HKm3TiTB+etokUnFYpkje7W55FrZqvMbL2Z/cvMJkZ5spmVmtk4MxvX1KTrc3cGM+OMifvz7JUncfnJYyjIy6Gypp7v/mMhn7ptNgtWbom7RBHpAt0pPN4GDgNGAQcBC4F/m9n+EfZxGbAYWFxRoU7dztS7II9vnzqeZ799Eh89dD8AXltdzadvm8O3H3iVDVs1tFckk5l71zc1mNlU4Bl3z2tju3eAX7j7Xe3cbylQCjBx4sTFr7766r6WKu00Z2klP370TRZv2AZA74JcLj15DF+ZMopeebkxVyci7WVm8919Ulvbdacjj9a0ANbejd19k7svcfcleXnvm0uSZsePKeNfl0/h2jMPoX9RPrUNzfzyicWc+uvnePrNDcTxJUVEOk9XD9XNNbNCoCC8XxguZmYnm9kYM8sxsz5mdg0wGHiyK2uUjsvLzeELx41kxlVTOf/YA8gxeG9TLRfeM48v/OklllZsi7tEEUmTrj7yOB/YQRAIueHtHcABwETgWWAbwWisY4EPu7subdfDDCgu4CefPJR/XX4Cx44eCASz9J528/P8+NE3qN6hob0iPV0sfR5dYdKkST5v3ry4y8h67s7jr6/nZ/96izVVOwAYWFzAVaeO53NHDSc3p92tkiLSBTKlz0P5PCIUAAAMbElEQVR6ODPj9A8O4dkrT+Jbp4yjMD+Hzdsb+MEjizjjt7N4ecXmuEsUkQ5QeEiXKMzP5YpTxvLslVP5+IQhALyxditn3T6Xy+57hbXhUYmI9AwKD+lSQ0uK+O25R/Dg14/j4CH9AHj0tbV86Fcz+c2z71DX2BxzhSLSHgoPicXRowby6GVTuO5TH2RgcQE7Gpu56eklnHLTTB5ftE5De0W6OYWHxCY3xzj3mBFMv3IqF0weSW6OsXrLDi7+2wLOvfNF3l6/Ne4SRWQvFB4Su/698/nRJw7hiStO4ISxZQDMXb6J0295nqv/73WqahtirlBEUik8pNsYO7gv93z5aP5w/pGMGNibFod75r7H1BtncO/cFTQ16yqGIt2FzvOQbqmusZm7Zr3L76YvpbYh6EQfVVbMtPHlTB5TyjGjS+nTS1PQiKRbe8/zUHhIt7a+uo7rn3ibR15Zs9v6vBxj4vASJh9YyuQxZRw+YgAFeTqQFtlXCg+FR0ZZuLqKxxatZ/bSSl5fW03qP9ui/FyOGjWQKWNKOf7AMg4e0o8cnb0uEpnCQ+GRsapqG3hh+SZmLa1kztJNLK/cvsc2A3rnc/yBZRw/ppTJB5ZxQGlvzBQmIm1ReCg8ssbaqh3MXlrJnGWbmL20kopt9XtsM7SkiMljgiau4w4spbxvYQyVinR/Cg+FR1Zyd5ZW1DB7aSWzl23ihWWb2Fa/5yWJxw/uu/Oo5JjRA+lbmB9DtSLdj8JD4SFAU3MLi9ZU7zwqmffeFhqadh/ym5tjTBzWn8ljysLO9xJd/VCylsJD4SGtqGtsZt6KLcxeVsnspZUsWrNn53thfg5HjRzI5DFlTBmjznfJLgoPhYe0Q3VtI3OXb2LOskpmLa1k+cY9O99Leudz3OjSnUcmI9X5LhlM4aHwkA5YV72DOUs3hX0mlWzYumfne7/CPIYO6M3QkiKGDQiWoSVFDA1/DiwuULhIj6XwUHjIPnJ3lm3cHgTJ0krmLt/Etro9O99TFeXn7gySoUnhEvzsTXnfXmoGk25L4aHwkDRrbnHeWFvNso01rN68gzVVwbJ6S/AztSN+bwpycxhSUhiES0kRwwb03hk2wwYUsV//QvJzdba8xKO94aHJgUTaKTfHmDCshAnDSvZ4rKXFqdxez5otu8JkzZZEuNSyZssOtodzdDU0t/Deplre21Tb6uvkGOzXrzDl6KX3bk1jhfkaDSbxUniIpEFOjlHet5DyvoUcPmLAHo+7O9U7Glm9R7jU7rxfVdsIQIvD2uo61lbX8TJbWn29sj4FDCwuoG9hPn0L85J+5tEv6XbfXrs/3q8onz698shVs5nsI4WHSBcwM0p6F1DSu4BDh/ZvdZua+ibWJh2prA4DJhEuG5POnK+saaCypuPXOenTK29XwEQNoMJ8+hQqgLKdwkOkm+jTK49xg/sybnDfVh+va2xmXXXdziOWqtpGttU1sa0u+Lk16fa2+sRjTTS37NmvWVPfRE19E+uqO15vcUEufQvz6Ve0K2z6Fe0KmOTbicf6FebTL7zdKy9Ho9J6MIWHSA9RmJ/LqLJiRpUVt/s57s6OxuadIbM1DJSdIVO3K2S2trIucbuplQDa3tDM9oZmOnq14Pxcaz1kwkDqGwZN33CbXbfDo6BeeZ02aq2lxWlxp8UJfwa3m1scb+22Oy0tjofbO+x8DIKf7uA4LS3BT09e57tvH+xn1zrHIbEuafvE66RuD84JYwdR3InXvFF4iGQwM6N3QR69C/IY3K9jk0G6O3WNLUnhkxo44ZHPjsad67bu2BVGiZ+pGpudTdsb2LS9Y81vZsHRWr/CfIp75eLhh3jiA7y5ZffbiQ/X5A/65qRgSA6MTDDjqqkKDxGJj5lRVJBLUUEu5f06to/mFqemPjz62bHrKCgInF2h1FrobN3RyNa6Rhqbd/9Ud2dniGWrHAv+PjkGhoGx83ZnZ6DCQ0Q6XW6O0b8on/5F+bDnYLQ2uTv1TS1hkOwZLLX1zZhBjhm5ObbzQ3W322bk5ATb7FqCkXKt3c412/nBnJuz++0cMyzpdk742skf5GbsrMnCdTlG+AFvGLt/8FsOe64L95F47s7HukFfkcJDRLo9M6MwP5fC/I4f/Uh66TRWERGJTOEhIiKRKTxERCQyhYeIiESm8BARkcgUHiIiEpnCQ0REIsvYi0GZ2UbgvQ4+PRcYDGwAmtNWVM+k92J3ej92p/djl0x5Lw5w90FtbZSx4bEvzGwcsBgY7+5L4q4nTnovdqf3Y3d6P3bJtvdCzVYiIhKZwkNERCJTeLRuE/Dj8Ge203uxO70fu9P7sUtWvRfq8xARkch05CEiIpEpPEREJDKFh4iIRKbwEBGRyBQeIiISmcJDREQiU3iIiEhkCg8REYlM4ZHEzHLN7AYz22hm28zsITMri7uuOJjZ9Wb2hpltNbO1ZnanmQ2Mu664mVmOmc0xMzezYXHXEyczO8XMXjCzGjOrNLPb4q4pLma2n5k9EH52bDGzf5vZxLjr6kwKj939J3AmcAyQ+GC4N75yYtUMnAeUAhMJ3o+74yyom/gWUBt3EXEzs6nAP4AbCf6NDAP+GGdNMbsNGAiMI5iWfR7wTzOzWKvqRJqeJImZvQdc6+53hfcPBJYCI929o9cGyQhm9hHgQXfvF3ctcQmn3H4c+AzwCjDc3VfHW1U8zGwuMNPd/zPuWroDM1sI/Nbd/xDeHw+8DQxy98pYi+skOvIImVkJMAKYn1jn7suArQTfvLPdh4DX4i4iLmaWA/wJuAqoirmcWJlZMXA0kGdmC8ImqxlmNinu2mJ0A/AZMxtkZoXA14BZmRocoPBI1jf8WZ2yvgrI2m/bAGb2GeAi4Iq4a4nRFcB6d38k7kK6gQEEnx3nAF8C9geeAh4Lv4Rlo9kEVxKsAGqATwMXxlpRJ1N47LIt/Nk/ZX0JwdFHVjKzs4A7gTPcfUHc9cTBzMYAVwKXxl1LN5H4v/Jnd1/o7g3Az4F84Pj4yopHeFT6DLCE4POjN/Az4HkzGxxnbZ1J4RFy9ypgJXBEYp2ZjSY46lgYV11xMrMLgDuAT7j79LjridEUYBDwuplVAokQXWhml8RXVjzcvRpYAaR2mHor67LBQGAUcKu7b3X3Bnf/I8Hn63HxltZ5FB67+wPwPTMbZWb9gOuBJ919RbxldT0zu5xgJM1p7j477npi9iBwIHBYuJwerj8VuCeuomJ2G3CBmR1sZnnAd4B6YE68ZXW9sF9jCXCJmRWbWZ6ZfZmgKTxjv3jmxV1AN/MLgvbcl4FewNMEw1Wz0S1AEzA9ebShu/eJraKYuHstScNzww9LCPpAauKpKnY3Enw4/hsoJBh99tHwqCQbfZKg0/w9gua7pcBZ7r481qo6kYbqiohIZGq2EhGRyBQeIiISmcJDREQiU3iIiEhkCg8REYlM4SEiIpEpPER6CDP7kpnVxV2HCCg8RNrFzO4OLwCVumTllOwiOsNcpP2mA+emrGuOoxCRuOnIQ6T9Gtx9fcqyEcDMVpjZtWb2p/DSvRvN7CfJV5Izs/5mdld4/Ys6M5ttZrtNnGdmY8PLH28xs1oze8XMpqVsc4KZvRo+/pKZHd41v77ILgoPkfT5JsHMzJMILhp1JXBx0uN/Bk4CPgccCSwDnkxM221mQwiuC9GbYPLFCcBPU14jP1z3jXAfVcD94bTgIl1Gc1uJtIOZ3U0wSWZqh/Uj7n6+ma0A3nX3aUnP+SXwaXcfY2ZjCWZe/bC7PxM+nphA7x53/6GZ/RS4ABjj7jtaqeFLBAE00d0XhusmA7PQpZKli6nPQ6T95gBfTlmXPKvu3JTHZgNXhZcl/QDBtS5mJR5098bwWuAHh6uOILh06R7BkaQJeD3p/trw52CCGV1FuoTCQ6T9at19acw1NLt7S9L9RNOBmq2kS+kfnEj6HJty/3iCpqw64E3ACK5KCOxstjoOeCNctQCYbGZFXVCryD5ReIi0X4GZ7Ze6JD0+ycx+aGbjzOx8gmue/xogPGJ5GLjdzE42s4OBuwguPva78Pm3EVxY6WEzO87MRpvZmamjrUS6AzVbibTfNGBd6srwCALgZmAMMB9oILga421Jm34ZuAn4H6A43O40d98A4O5rzWwK8EvgSSAXeJtg1JZIt6LRViJpEI62ut3dfxF3LSJdQc1WIiISmcJDREQiU7OViIhEpiMPERGJTOEhIiKRKTxERCQyhYeIiESm8BARkcgUHiIiEtn/B1CUqx3mK+1vAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "plot_perplexity(dev_perplexities)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prediction and Evaluation\n",
    "\n",
    "Once trained we can use the model to produce a set of translations. \n",
    "\n",
    "If we translate the whole validation set, we can use [SacreBLEU](https://github.com/mjpost/sacreBLEU) to get a [BLEU score](https://en.wikipedia.org/wiki/BLEU), which is the most common way to evaluate translations.\n",
    "\n",
    "#### Important sidenote\n",
    "Typically you would use SacreBLEU from the **command line** using the output file and original (possibly tokenized) development reference file. This will give you a nice version string that shows how the BLEU score was calculated; for example, if it was lowercased, if it was tokenized (and how), and what smoothing was used. If you want to learn more about how BLEU scores are (and should be) reported, check out [this paper](https://arxiv.org/abs/1804.08771).\n",
    "\n",
    "However, right now our pre-processed data is only in memory, so we'll calculate the BLEU score right from this notebook for demonstration purposes.\n",
    "\n",
    "We'll first test the raw BLEU function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sacrebleu"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "100.00000000000004\n"
     ]
    }
   ],
   "source": [
    "# this should result in a perfect BLEU of 100%\n",
    "hypotheses = [\"this is a test\"]\n",
    "references = [\"this is a test\"]\n",
    "bleu = sacrebleu.raw_corpus_bleu(hypotheses, [references], .01).score\n",
    "print(bleu)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "22.360679774997894\n"
     ]
    }
   ],
   "source": [
    "# here the BLEU score will be lower, because some n-grams won't match\n",
    "hypotheses = [\"this is a test\"]\n",
    "references = [\"this is a fest\"]\n",
    "bleu = sacrebleu.raw_corpus_bleu(hypotheses, [references], .01).score\n",
    "print(bleu)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since we did some filtering for speed, our validation set contains 690 sentences.\n",
    "The references are the tokenized versions, but they should not contain out-of-vocabulary UNKs that our network might have seen. So we'll take the references straight out of the `valid_data` object:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "690"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(valid_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "690\n",
      "when i was 11 , i remember waking up one morning to the sound of joy in my house .\n"
     ]
    }
   ],
   "source": [
    "references = [\" \".join(example.trg) for example in valid_data]\n",
    "print(len(references))\n",
    "print(references[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\"i 'm always the one taking the picture .\""
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "references[-2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Now we translate the validation set!**\n",
    "\n",
    "This might take a little bit of time.\n",
    "\n",
    "Note that `greedy_decode` will cut-off the sentence when it encounters the end-of-sequence symbol, if we provide it the index of that symbol."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "hypotheses = []\n",
    "alphas = []  # save the last attention scores\n",
    "for batch in valid_iter:\n",
    "  batch = rebatch(PAD_INDEX, batch)\n",
    "  pred, attention = greedy_decode(\n",
    "    model, batch.src, batch.src_mask, batch.src_lengths, max_len=25,\n",
    "    sos_index=TRG.vocab.stoi[SOS_TOKEN],\n",
    "    eos_index=TRG.vocab.stoi[EOS_TOKEN])\n",
    "  hypotheses.append(pred)\n",
    "  alphas.append(attention)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([  70,   11,   24, 1460,    5,   11,   24,    9,    0,   10,    0,\n",
       "          0, 1806,    4])"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# we will still need to convert the indices to actual words!\n",
    "hypotheses[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['when',\n",
       " 'i',\n",
       " 'was',\n",
       " '11',\n",
       " ',',\n",
       " 'i',\n",
       " 'was',\n",
       " 'a',\n",
       " '<unk>',\n",
       " 'of',\n",
       " '<unk>',\n",
       " '<unk>',\n",
       " 'joy',\n",
       " '.']"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "hypotheses = [lookup_words(x, TRG.vocab) for x in hypotheses]\n",
    "hypotheses[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "690\n",
      "when i was 11 , i was a <unk> of <unk> <unk> joy .\n"
     ]
    }
   ],
   "source": [
    "# finally, the SacreBLEU raw scorer requires string input, so we convert the lists to strings\n",
    "hypotheses = [\" \".join(x) for x in hypotheses]\n",
    "print(len(hypotheses))\n",
    "print(hypotheses[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "23.4681520210298\n"
     ]
    }
   ],
   "source": [
    "# now we can compute the BLEU score!\n",
    "bleu = sacrebleu.raw_corpus_bleu(hypotheses, [references], .01).score\n",
    "print(bleu)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Attention Visualization\n",
    "\n",
    "We can also visualize the attention scores of the decoder."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_heatmap(src, trg, scores):\n",
    "\n",
    "    fig, ax = plt.subplots()\n",
    "    heatmap = ax.pcolor(scores, cmap='viridis')\n",
    "\n",
    "    ax.set_xticklabels(trg, minor=False, rotation='vertical')\n",
    "    ax.set_yticklabels(src, minor=False)\n",
    "\n",
    "    # put the major ticks at the middle of each cell\n",
    "    # and the x-ticks on top\n",
    "    ax.xaxis.tick_top()\n",
    "    ax.set_xticks(np.arange(scores.shape[1]) + 0.5, minor=False)\n",
    "    ax.set_yticks(np.arange(scores.shape[0]) + 0.5, minor=False)\n",
    "    ax.invert_yaxis()\n",
    "\n",
    "    plt.colorbar(heatmap)\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "src ['\"', 'jetzt', 'kannst', 'du', 'auf', 'eine', 'richtige', 'schule', 'gehen', ',', '\"', 'sagte', 'er', '.', '</s>']\n",
      "ref ['\"', 'you', 'can', 'go', 'to', 'a', 'real', 'school', 'now', ',', '\"', 'he', 'said', '.', '</s>']\n",
      "pred ['\"', 'now', 'you', 'can', 'go', 'to', 'a', 'right', 'school', ',', '\"', 'he', 'said', '.', '</s>']\n"
     ]
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZUAAAEhCAYAAAC3AD1YAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAAIABJREFUeJzt3XmYXFWd//H3J52EAFmAhM1AIAgBRcUdHVwQUXDHUQcVUAZFGR3UnzCKimwiojhuzKgDqFHEFVHABZElCq64sChKhBACkS1AEgJIku7v749zmtwUvdzqul23qvJ5Pc99quou556qdOrU2b5HEYGZmVkVJtSdATMz6x0uVMzMrDIuVMzMrDIuVMzMrDIuVMzMrDIuVMzMrDIuVMzMrDIuVMzMrDIuVMzMrDIT686AWa+SdEaZ8yLibeOdF7N2caFiNn4m1Z0Bs3aTY3+ZmVlVXFMxaxNJAp4JbA8sAa4K/6qzHuOailkbSNoeuBB4HHAXsBXwV+CVEbGkzryZVcmjv8za47PAVcAWEbE9MBP4LfC5WnNlVjHXVMzaQNJdwA4R8VBh3ybA4ojYqr6cmVXLNRWz9vgnMKNh3wxgdQ15MRs3LlTM2uP7wPcl7SNpJ0n7AOcC36s5X2aVcvOXWRtI2hj4DPAmYCPgYeBrwHuKTWJm3c6Filkb5WHFWwJ3ezhx55P0fGCriPhu3XnpFm7+MhuBpA8Ns/8DY0hrcJ7K84Bn5NfW2T4K/K8kf1eW5JqK2QgkrYyI6UPsvzcitmgiHc9T6TKS5gB/BP4OnBgRF9Wcpa7g0tdsCJIeI+kxwARJ2w6+ztvzSX0izfA8le7zBtIAi3PycyvBNRWzIUgaAIb6zyGgH/hwRJzaRHqep9JlJP0JOAq4Pm/bRkSzPyY2OI79VQNJT4+I39edDxvRXFIBcjWwR2H/AKmT/Z9Npjc4T6U40muDnaci6bgy50XESeOdl6FIejywDXB5RISkq4GX4yHgo3KhUo9LJPUDlwOXAJdExI0158kKIuKW/HSzipIcnKfyIWAxsCPwETbcL6nnFp6LNHjhDuAWYAfSF/rPa8jXoDcC5xZG6H0779tQ/71Kc/NXDST1Ac8CXgjsSxoRdAfws4g4vM682aNJejbwdGBacX9EnNJEGp6nMgxJnwKWAR8b/BLPo+tmRcRRNeXpJuCQiPhVfj0LuBl4TETcX0eeuoULlZpJ2gN4NfAeYOOI2KjkdX9n6DZ/ImLeGPIxjUd/af6j2XR6jaQTgA+SmsEeKByKiNhnDOl5nkoDScuAbSJibWHfROCOiJhVQ362Bo6LiHc27P8wcGFEXN3uPHUTFyo1kHQoqYbyQtIvtEvztqDsryBJb27YNRt4K3BWk7+gnw18FXhscTfpS7OvbDq9StIdwKsi4rd156VXSboVeEXxy1rSU0hf4NvVlzMbCxcqNcgji/4OfBj4XkT0V5TuE4FPRsR+TVxzLalf5yzW/yVe7FfYYEm6kzTqZ6DFdOYBpzN0M9rkVtLudrmp693A/7Guv+ltwOnN/EAaT5L2Bvoj4oq689LpXKjUIAcT3DdvuwBXAD8j9an8rYV0JwDLh5qsN8I19wPT3RQzNEkfJQ37PbPFdH4F3AbM59GFd50d0h1B0iHAIcB2wFLg7Ij4Wo35uRg4OSJ+IendwMdIQ8mPi4hP15WvbuBCpWaSNgfeCRwNTCvb5JQn5hVtChxGakZ4QhP3vwT4z1YKs14j6Wes66+aQBqZdANwe/G8iHhxE2muBGZGxJqq8mnjJ88rmh0RayT9GTgCWA78ICJ2rjd3nc1DimsgaRvW1VReCMwCfkVqhirrNtbvqBep6aCxr2U0lwIXSPoiaQTaIyLiG02m1SuubHhdRZPH30ihWZZWkFbXk7RNRNyRnzf+QHpEjYNFJucCZWtSQMkrASR5ouooXFOpgaS1pNFEl5IKkiuanUwnaYeGXfdHxL1jyMvNwxyKiNip2fRsHUn/Uni5B6nA/wSPLrx/1c58dYJiTLVhohfUOlhE0h9JoXUeC8yLiNdLmgn81REQRuZCpQaSthhLAdDpJE0FXgZsD9wK/CgiVtWbq9ZIet4whx4Gbhn8tT3MtWU69zfIUXaSto+IW/Pzxh9Ij6hrsIikfUmjIh8mjf67TtKbgAMj4mV15KlbuFCpSatfwHm+w3tJw4gH0zgL+HSrI5XGQtLupMEG/awbwdMHvDgi/tzu/FRF0hpSv0oxTH3xP83PgYMiYr3+lg2JpJ2A15P6IN4paVdgYkT8peasNU3S1hFx5zDHJgG4X2xkLlRqUMUXcA73cRjwceAmUjX9fcD8iDi5ibxsDBxL6tvZksKXZzPNX7lz+0rgpBwrSTndvSPihWXT6TR5VNLLSBMgB0OInAxcDPwa+BSwKiIOHCWdzYHVEfFAYd+mwKSIWD5O2R93kl4EnEcKObR3REyXtBdwbES8pIl0dgP25tF/g22N/SVpBfBnclidiLipnffvCRHhrc0bqUA5nnWFukhzVi5tIo0bgd0a9u0KLGoyL18k/Sd6J7AqP/6V9KXQTDr3kL4gi/smAfeM4fOZChxIGhF3IDC1xn+rmxvvT5pncnN+vjVp5vdo6VwBPLNh357Az+v8W6zg8/kDsH9+fl9+3Bi4s4k03kBqZrqq8LiaFBOv3e9nI1LgyLOAO/P/jY8AT6v7s+6WzTWVGki6hxSWYk1h3yTSl9PMkmncC2w9RBp3RnOLRy0FnhsRiyQtj4jNcoTW06OJGkaOlbR/RPy9sG8X4OKImNtEOh3VjJb/reZFxD2FfbOAhRGxRdm5Qfnfa1YUmiZzDLi7m/n36jSDfzP5+SMLlzWziFkesntSRHxH0n0Rsbmkw0g/mt43frkfNV8TgL2AA/I2ETgf+AEp+kXbm5m7gRfpqsdy0pdl0Y7AyibS+BPwXw37jiaNKmvG1IhYlJ+vljQ5Iq4HntFkOl8FfiTpMEn75C+FC0mT/ZrxGdLM6jkR8VxgDvAF0kicOvyIFF34+ZLm5pnV5wI/zMf3JDWLjeafwCYN+zYFur19/lZJ682LyvHsFjeRxhygcQ34r5EmQ9YmIgYi4oqIOCoiHgu8glQj/29Sc6gNpe6q0oa4AccBC0l9Ivvkx78BxzeRxpNYFyr8F/nxDuBJTeblauBx+fkvSJO8Xg/c2mQ6fcAHSJMEH8yPHyB12DaTTmXNaBX9W00FvkwqFAby45fJTWKkdVceVyKd75BWeZyQX4tUgJ5X999ji5/P4flv+WBgBfAaUpPRIU2ksQTYLD//G2nJ5a2AlTW9py2Bo4fYf2jh/8qkduapmzY3f9UgN3u8j/RHOjhyaz5wWhQitY6Sxvak/8SNI8iaqe0g6UBS881Pc6fr94HJwH9ExJeaSGdO8SXrRkg9HMOMphkmnUqa0aqWm0JmActiDM0eedjsZcAUYBGwE6nf4AURsbjCrLadpMOBd5EK2MXAZyLirCau/zJprtZXJJ0EvIVUg/tdRPzbOGR5tPxMIE0u3j8irs37ppEmru4UEcvanadu4kKlRvkPdTqFIapRcgZxHup6MalD8YIYY1DKPFLn9shNYLlfZh4wI5qYlDfMBLZBDwPfIq0dMmKhp7Qi4MHAqazrU3kf8K2IOKGJ/Ig03HqoUW1Nh6yvQh5p93LSe7qdFEy0J9ZSUUVLJ+R/tzfktL5W1+cj6dPAQxHxwfz6IOBN0USw1g1W3VWlDXEDnk1qHuovbAOkKKhl03gscArpF9XtpIB3jx1DXv5M+vXVmPZ1TaZzGGlY6T75+n1IEQPeDuwH/A44o0Q6Exi6Ga2vyfycQmoOPI0UwPG0/PpTJa79U+H530nNO4/amszPyeTRX8CL8ntbRRqAUPvfZM1/yx332ZAWzrup8Pp84M11f97dsLmmUoMqw83nqvrLSF/qLyHNFTmT9Ct41Ka0YriMhv33R8S0oa4ZJp2/Ac+JQtOApC1JzRq75Qlyv4hR1seQ9NeIeNwQ+6+LiCc2kZ/FpJnQ1xRGFD0LeF9E/Oso174xIr6htFDUUaRmxkf9Yo6IrzaRn1uB3SNipaRfkDqmV5KCeTY7KKJjVPG33KmfjdJCeG8k/YhYAmwXXvVxVC5USpB0OblpJypoOlHF4eYlTQH+jfQFOJe08Nck4C0RcfEo195EatdfUti3A2n+xI5N5GE5aUZ1cXLfVOC2WDfkdNSCarhzBguGJvJTjC21jDT8ur+VdFohaUVEzMgTHv9Bili8ttn8VJCPjvtb7pTPZoh8nUQaqHEN6QfKiD9GLHGU4nLmV5zeb0kTFVsKNy/paaR+g9eTZtV/HjgnIlZJOhj4CmlFyJF8Hzhb0ttJTT275HTOazI7VwDzJR1NGjQwh9QvckXO6xNpCKTY8F4Gh2hOLDwftHNOsxlLJc3JheUi4CW5cGl2CO/vJT0pcodtC+7Js8afAPw2f2lu3GKawCNBQS8jrfUxWhTk+VXcs6CKv+Vx+2xa9E3SnKnHk2piVoILlXKqrs61HG5e0tWkEUTfBl4UEb9vSOfrkj5VIqnjSUNkr2fd+zyXNMO/GW8FvkGagT6YzgJS8wGkvpLDR7j+RflxUuE5pPb5O0jNe834AvA0UrPFp0kT1kR6v824HLhQ0hmkYduPjPwq+2+VfYY0+xzgoPz4PNLn3qqvkkYAXgrsNsq5Lf8tS3pj4WUVSyeM52cDgKQfRZOBICPir/mHyLOAV1WVl17n5q8ScpMBpIiyVTQZtBxuXtIRpFpJJW28uf9jR9Iqh3e3kM5sUu1oaYlfzUNdf3pEHDnW+4+Q7nakuSVN/aKu4t+qkNYuwNqIuDm/nkdat6NtkQKq+Fse4TMpaurzGe/PRtIXIuI/xnDda0nNunVNvu06LlTMzKwyDtNiZmaVcaFiZmaVcaFiZmaVcaFiZmaVcaFiZmaVcaFiZmaVcaFiZmaVcaEyRpJmSjpBUqnlf8crDafTXel0Ul6cTvvS2aDUHSa5WzfSmiNBWr+8tjScTnel00l5cTrtS2dD2lxTMTOzyjhMyzD6pm0aE2cNH3U7BgYYWLmKCdOnoglDl81T7hx51dmIAVavfYDJEzclLYsyjDUjL4sSDLB64CEmT9gYjfQ7QRr+2GB+Bh5k8oRNRsxPrBk50G8wwGoeZjIbjZyfUfRiOp2UF6dTXTr3c9+yiNhyrPfZ7wWbxj33llu89Q/XPvzTiNh/rPcab45SPIyJszZn2xNbi2u426ceGP2kEnT7mOM7rm/SpEqSWXt76SXnh9f8Mu9mHeuSOLepxfUaLbu3n9/+dMT16x4xadubZrVyr/HmQsXMrHZBf4/80HKhYmZWswAGKl+2qR49V6hIWsC61e0OjYi9a8uMmVlJA7imYmZmFQiCNW7+6j15gtNMgElztq05N2a2oQig381fnamhuWt+k5cfSV7DfGDlqopyZGY2ul7pU/Hkx/WdDuwK7Dph+tS682JmG4gA+iNKbZ2u52oqrYiIe4B7ADaaW27MuJlZFXqjR8WFiplZ7YJwn4qZmVUjAtb0RpniQsXMrH6in5Fj83ULFypmZjULYMA1ld42ZekaHnfs7S2lsXbOmIOWrqdv1haVpKN7l1eSTt+M6ZWkU4X+FSurSahHJp410sRqgojG2pEjU1vrXFMxM7NKpMmPLlTMzKwCAayJ3pg2WNu7kPQTSe+r6/5mZp0iEP1MKLV1utpyGBEviYhPlDlX0mJJB5dNu9nzzczqNhAqtXU6N3+ZmdWsl/pU6mz+WiDp2Px8jqRzJd0h6XZJZ0ialo9dCMwBzpK0StLFkp6Rnxe3kHTAUOfX9R7NzMoR/TGh1Nbpas+hpCnAZcD1wFzg8cB2wGcBIuIVwBLgrRExNSJeHBFX5edTI2Iq8DHgRuDKoc5vIi8zJc2TNC96JhKPmXW6tPLjhFJbp+uE5q+XA4qI4/LrhyR9GPiVpMMjon+kiyUdArwL2CsilrWYl0dC36/uf6jFpMzMyokQq6Ov7mxUohMKlbnAHEmNM/MC2AZYOtyFkvYlhavfLyJurCAvpwPfAJjct/ENFaRnZlbKQI/0qXRCoXILsDAidh/hnEe1RUl6EvAd4M0R8dvRzi+jGPp+xuStx5KEmVnTUkd95zdtldEJ7+KHwGRJH5Q0TclsSa8unHMHsMvgC0mzgR8Dx0bE+UOkud75ZmadrT0d9ZL6JJ0m6W5J90v6nqRZI5x/tKSb8rl/l/SO0e5Re6ESEQ8C+5A66P8GrAAuBZ5cOO1k4GBJ90n6CfAiYDbwiYYRYC8f5nwzs47Vxo76Y4BXAXuSBkQBnD3UiZJeCZwIHBQR04A3AadJetFIN6iz+asPWA0QEbcCw05WjIgfk2omRfObPN/MrGP1l5/Y2CdpXuH1Pbnpvoy3ASdFxCKAHNXkRkk7RMQtDefuDFwTEb8BiIhfS7oW2AP42XA3qKVQyXNQdiYNA+5Ia6dP5p4XzGkpjXteWs0IskmTNqoknX/eMbeSdHY7bmHLaQw4unBXqSLasSMdDy8Qa6L01/HWQHEg0YnACaNdJGkz0hy+Pzxy34ibJK0kFRSNhcq3gMMk7QX8GtgLmAdcNNJ92l6oSHoKsICUsQvafX8zs07TZEf9ncDehddlaynT8uOKhv3LgaHWs7gLOBe4nHVdJe+JiD+PdJO2FyoR8SdgRrvva2bWqQI10/zVHxFjaS64Pz82fv9uBgzVdPBh4A2k/u2/kvq9L5D0UER8abib1N5Rb2Zm499RHxHLSdFGnjq4T9JOpFrKtUNc8jTg+xFxfSR/AX4AvGKk+4w5h44EbGZWjQjaFfvrDOD9kuZKmg58HPhpRCwe4txfAgdI2gVA0uOAAyj0yQylEyY/VkbSjsDNwPYRcVu9uTEzKyd11LclTMupwObAVcBGpFFcBwNIOgj4vxxPEeA0UlPZz/JclnuB7+Y0htVThYqZWbdqx4z6HEvx6Lw1HjsHOKfwei1pXssxzdyjknchaRNJ50v6kaTtJF2UZ2yukHSFpKcVzj1B0qWSTpF0V95OLBzfW9JaSQfmmZwrJH2nEApfkj4q6R95ludiSUfmy6/JjzfkyZAfbvJ9rItSPOChqmbWHkG5Bbq6YZGulgsVSdsAPwf+Abwyp/l5YAdSQMg/AudJKg50fx6pw+gx+ZoP5rHQg/qAF5PGTs8DnkKKRAxpNv2bgT3zLM9nAlfmY3vkx11z2PuPNPl2jiSN/75h7T9XNXmpmdnYeTnhZHfSpJjvRsR/RER/RCyJiAsi4sGIeAg4ljThphiLa2FEfDEi1ubZmlcDT29I+5iIWBURd5JGHAweXw1MAXaXNCUi7srDlKtwOrArsOvEKVNHO9fMrBIBDMSEUlunazWH/w48QKqZACBplqSvSVqSZ2remg9tWbju9oZ0HmDdxBxI47DvHup4RCwAPkgqrO7KK0E2FkhjEhH3RMTCiFioCZ3/j2dmvUL0l9w6XavfnMcA15FGB2ye930M2JbUPDUd2D7vr+zTiIgzIuI5pOa1q4Hz8iF3hJhZ1wlgTfSV2jpdq4XKWuAg4M/AAklbkSbSPAjcJ2kqaRx0ZSQ9U9JzJW0EPEyaJTq4OuTdpILFYe/NrGtEyM1fgyJiICIOJ4Wrv4LUFLYVKR7NtcCvWPelX4WppPXrl+V7vBg4MOflIVJogW9KWi7pQxXe18xs3LRp8uO4G/M8lYjYseH1e4H35pfPbjj964XzThgirb0Lzxc05qt4TURcRiHMwBBpnQKcMmLmzcw6SFpPpfP7S8rw5MdhTLz3QTb/TmuDymb+YsvRTyqjr6J21Mn/rCSZgYeqScfGn8PNdwt1RS2kDBcqZmY1S0OKXVMxM7MKtDH217jr2vpWDgcTOYikmVlXa9Ma9ePONRUzs5ql0Pdu/jIzs4r0Sp9K59elMknbSLogRy1eCOxfODZf0lkN53sRMTPrCilKcW9Mfuymmso5pHWU5wAbA+dWfQNJM4GZAFMfiTpjZja+UpiWzi8wyuiKQkXSbGAfYOeIWAGsyGuwXFzxrY4EjgdYHZ6LYWbtoq6ohZTRLe9iu/x4S2HfzeNwn0dC30/WlHFI3sxsaAOo1NbpuqKmAizNjzsAN+XnOxaO3w/MGnwhaSIp/lhTIuIeUjwxZkyYOZZ8mpk1rZdGf3VFTSUibgMWAJ+QNF3S1sBxhVP+ALxQ0twcvfijwKRHp2Rm1pl6paO+83O4zhuBjUiLfl0BfK1w7BzgAtLSxTeRlipe2piAmVkn6qU16rul+YuIuB14ecPu4jDit+Zt0P+Oe6bMzCoQwNouqIWU0TWFSrsFEP0tLgOz6sFK8sIWm1WSTP/0agYfTNhlx5bTuH/XGa1nBJh+6Q2VpNN/732VpIMq+mIIL2K6oemGpq0yXKiYmdWtS5q2ynChYmZWMy/SZWZmleqVmkpvNOI1kDRZ0rcl3SdpWd35MTMbyeAiXR791bleCzwTmB0RFfWWm5mNj0CsHeiN3/i9WqjsBNzkAsXMukWv9Kl0bNEo6d2S/ibpfklLJH1MUl8+FpKeUzh3b0lr8/P/Ic2231vSKknzm7jnTEnzJM0LD+k0s3YJN3+1w23AS4DFwJOBi/Lz/xvpooj4z9yP8pyI2LfJe66LUoyjFJtZewz2qfSCjq2pRMT3IuLmSP4EnA28cJxvuy5KMY5SbGbt45rKOJP0BuC9pP6RicBk4Dfjec9ilOLpjlJsZm0SiP4e6ajvyHchaXvg68DJwLYRMYMUy2uwmF4FbFq45DHtzaGZWbV6ZT2VjixUgKmkvN0NrJH0LOCQwvE/AG/O81F2JNVozMy6UvRQR31HFioR8VdSh/n5wHLgGOCbhVP+E9gZuBf4DjC/zVk0M6tUhEptna5j+1Qi4iTgpGGO/Zk0ubHofwrHTxi/nJmZVa07aiFldGyhUrsIYu2alpJYW1U49YrS0YRq/mhjo41aTmPGg9tUkBNY+IHdKknnscf8rpJ0YiAqScc2PO2oheS5fqcChwJTgIuBt0fEkOGsJG0FnEZay2oSsAh4aUT8Y7h7dGTzl5nZhiQC+gdUamvRMcCrgD2B7fK+s4c6UdIU4FJgNWmqxWbAQaSBUsNyTcXMrAM0MbKrT9K8wut78nSIMt4GnBQRiwAkvQ+4UdIOEXFLw7lvJhUk74iIwWabv4x2g46tqUj6Yg65YmbW04KmOuq3Bm4obEeWuYekzYA5pNGz6b4RNwErgT2GuOQFwN+B+ZLuyWGz/t9o9+nYmkpEHFF3HszM2qOpjvo7gb0Lr8vWUqblxxUN+5cD04c4fxapYHkP8O/Ak4CLJN0VEecMd5OOLVTMzDYkUX6MR39ELBzDLe7PjzMa9m9Gqq0Mdf7SiPhsfv17SV8n9ckMW6jU2vwlaRNJn5R0s6R7JV0kaed8bL6kswrnhqR3SLoqRy7+jaTdCscnSvqgpIWSlkv6paSn1/G+zMyaNd7zVCJiObAEeOrgPkk7kWop1w5xydWklrlHJTXSferuUzkT2A14FrAN8Fvgh5ImDXP+ocBrSNWyW0kBIAedSCpB9wdmAl8mVdU2L5uZ9ULf49D3ZtYeafTXhFJbi84A3i9prqTpwMeBn0bE4iHOnQ/MlPROSX2S9iCN/jpvpBvUVqhImgW8kTSy4M6IWE0qGLYlDXcbymkRsSQiHia94afntAS8C/iviFgUEf0R8SXgduBlTWTrSHLn12oeHsvbMjMbk4hyW4tOBS4ErgKWAn3AwQCSDpL0yHDhPBrspcBbSc1j5wInRMS3R7pBnX0qc/PjtalMeMQkYPthrrm98PwB1nU8zSLFC7tQUvFjn8S6sdhlnA58A2AyG93QxHVmZi1px+THiOgHjs5b47FzaOgriYgFwFOauUedhcrgmOhdIuLuxoOS9msirWWkQmbfiLhqrBlaL/S9thhrMmZmTQm6I65XGbU1f0XEXaRaweclzYY0jlrSqyVNbTKtAD4LfFLSLjmtqZL2k+Sw+GbW8aLk1unq7qg/nNSHsUDS/cB1wOsY22c3GNX4fEkrSZN2jqD+92hmNrKAGFCprdPVOk8lIh4Ejs1bo0MbzlXD6wUU8h8Ra4FP5c3MrKv0SvOXJz+ORC1WcqKiYcmt5mMwmcmTK0mnCiv32KqSdCY+UM1/xItu/WMl6bz0CXtXkk7/fY2Tnseoqr9BG3cVjOzqCC5UzMxqNhj7qxe4UDEzq1sALlTMzKwqvdL81XRjfZ51eU3Jc9eL39XEPVZJenaz15mZdadyI7+6YfRX04VKRJwTEUPF3m+apL0lrR3iHlMj4tdV3MPMrCv0yESVppu/JE0qrAJmZmatit7pqB+1piJpsaTjJF2eg40dJenGwvFJOeT8DTkk/U2SXltIYiNJZ+Zw9EslvT1f9xjgJ6SlMVfl7c35WEh6TuEeb8nprpR0tqSvS5pfOD5H0rmS7pB0u6QzJE2jSY5SbGa16ZGaStnmr8OB95ICODYOoD+ZFOXydaS4/M8HigvIvJYUFXMLUhTg/8nrIf8DeAlpwZmpeftq440lPQ/4n5yHLYAfA/9WOD4FuAy4nhSk8vGkIJKfbUyrBEcpNrOaqOTW2coWKmdGxJ9yjK2HBnfmkPPvJIWcvzaS2yKiuODLZRFxQUQMRMR5pKUrn9xEHt8EfDciLouItRHxTdK6K4NeDigijouIhyLiPuDDwEGS+pq4D6QoxbsCu05moyYvNTNrwUDJrcOV7VNZPMz+LYFNWb9m0uj2htfFkPVlzAZ+37DvlsLzucAcScsbzgnSwl9Ly97IUYrNrBYb4DyV4crHu4EHgV1IARybVabcXQrs0LBvDrAoP78FWBgRu4/h/mZmHWGDnadSlJvDPg98QtITlGwn6Uklk7iD1FE/d4RzzgZeK+kFeUnLA0nLDw/6ITA5DxaYlvMwW9Krx/SmzMzqsIF11I/kQ8B3gB8A9wMLgJ3LXBgRC4EvAL/Lo8MOGeKcnwPvJq05fx+pD+UHkHrSc6TjfUgd9H8jDSS4lOb6bczM6hUqt3W4UZu/ImLHhtfzSevDD75eDZyUt8ZrDy2R3juAdzTsawxzfyZw5uBrSb8Gri4cv5W8zrKZWTdSF9RCyuiK2F953stFwGrSOisfquAlAAAR/klEQVRPJ40KG8+bor5mB4+tL/orykqL+XhEfzUZWrPXE1tOY9NbHqggJzD9ijsqSeeln3l+Jensc+WtlaRz2V6zK0mnf+X9laTjEPrjLARdEIKljK4oVIDXAGcBfcCNwKsjYiwDA8zMOpNrKu0TEW+oOw9mZuOqRwqV2tZvbwzFMobrDy2GizEz62o9MvqrK2oqZmY9bQOc/GhmZuOoV0Z/VdL8Jeldkm7OUYqXSjol799R0ndz5ODlkn4paWbh0idJuipf9xtJuxXSXCDp2Ib7DNtkJmlingC5sHCvp1fx/szMxl2PNH+1XKhImgecCrw8IqYBuwMXSNqEFD34LmA3YBZwFGlY8KBDSSO7ZgG3kgI6jtWJwKuA/YGZpMmSF0navIn3si70vYdQmlkbKcptna6KmspaUjzm3SVNjYjlEfEb0sz3jYF3R8SKHGH4NxFRHDh/WkQsiYiHSRMqx1SzyNGS30WKlrwoIvoj4kukYJYvayKpQuj7f44lK2ZmY9MjM+pbLlQiYhFwEGm9k39IulLSi4EdgUUR8ajlgguKEYybjV5cNAuYClyYm76W56jFO5HWVimrEPp+yhizYmbWpLJNX11QU6mkoz6vk3KepMnAEcD5wNuBuZL6IsY0t/x+Ulh94JGVIoezjFQo7RsRV43hXkBD6PsJM0c528ysQl1QYJRRRZ/KrpL2z30oa0gBHQP4Hqn/5NOSZuSO9Gc1sczvH4ADJG2Zr/nocCfmaMmfBT4paZecr6mS9hulMDIz6wgaKLd1uir6VCYDx5GaspaT+jZeExEPkKIHb09aa2UZcBowqWS6nwb+CtxECh75o1HOP55UQzpf0sp8zyOocYKnmVlpbv5KIuI64F+GObYIGHJdkyEiES8o5iciVgD/2nCZCsfns3605LXAp/JmZtY1umVkVxme/DgMTZjAhOlTW0ojHniomrxsunE16Ww2o5J0Fh3ceuXvsWeXrbCO4t7GVaTHJtauqSSdS55cegT7iCZMqSaitCZUM1qoqojbNoIuGNlVhgsVM7NO4JqKmZlVxc1fZmZWjeiOkV1ldMTIKEknSLqk7nyYmdWmDaO/JPVJOk3S3Tnm4vckzSpx3X/k2IvHjnZuRxQqZmYbvPYMKT6GFCNxT9ZFGzl7pAsk7UCK23hdmRu4UDEz6wBtCij5NuDjOUbiCuB9wP654BjOl4APAfeWuUFlhYqkbSRdKGlFDj//llxd2jEfP1zSn/PxP+X4YA1J6BRJd+XtxIaDT5D001xtWyLpY5Im5WM75nsdIun6XK27WNK2Tb6HdVGK6ZEGTjPrNX2D31N5KxVTStJmwBxStBIAIuImYCWwxzDXvB14ICK+XTZzVdZUziGFZdkeeA5wSCFjhwPvJwWe3JxU6p0naefC9c8DlgCPAV4JfFDSXvn6rYCfA+cBs4FnAy8CPtCQhwNzOrNJccNOavI9rItSPOAoxWbWRuWbv7Ymf0/l7ciSdxgMkbWiYf9yYHrjyZLmAMcC7yj/JqpbpGs7UkiW/4qIlRFxF/CRwinvBk6KiGsiYiAifgxcDry+cM7CiPjiYIh8UmiWwVD4bwKuiYj/i4jVEbEU+FjeX3RiRCyLiJXAN2g+lP66KMUTHKXYzNokmor9dSf5eypvZdehGlx2pHEW9Gak2kqjs4CT8/dtaVUNKZ6dH5cU9t1SeD4X+F9Jn2u4922F18Uw+LB+KPy5wF45nP0gAX0N17QUSr8YpXjGxC2budTMrDXl+0v6I2Jh08lHLJe0BHgq6Uc7knYi1VKuHeKSFwFPkzQYzHcG8AxJ+0XEc4e7T1WFymBJNgdYVHg+6Bbg+Ij47hjTvwW4JCKaWXDLzKwriLZNfjwDeL+ky0k/oD8O/DQiFg9x7vYNr78LXAH890g3qKT5KyJuAxYAp0qaJmlLUlvcoE8DJ0h6spKNJT2nuCb9KL4GPF3SYZKmSJogaSdJ+1eRfzOz2rVnSPGpwIXAVaTKQB9wMICkgySteiQ7EbcVN+BhYGVE3DnSDarsqH8jsAmpSeuXpFIN4OGIOBP4BPAV4D5SM9mHKRkGPyLuAF4AHAAszml8n7Syo5lZdys5nLjV2kxeav3oiJgVEdMi4l8jYlk+dk5EDBtFNyL2joiTR7tHZWFaIuJ20rr0AEjaj1Sy3ZGPfxX46jDXnjDEvr0bXl9PGhU21PWLKYTFz/vmUwiNb2bW0XpkFkNlhYqkJ5M+lutIHesnA9/OqzJ2nejvZ2DFUAMimkujEg9XNLz53vsqSWbe225tOQ1NrCj0fV/jWI2xUUXpTJi9TSXpLH9GNenM+NGfK0mnf9Wq0U+ylvRKQMkqm782J80jWQVcSRpN8O4K0zcz611e+XF9EXE5sPOoJ5qZ2fq6pMAow6Hvzcw6QK80f7lQMTPrBC5UzMysKr2ySJcLFTOzurlPpTflENIzAaayWc25MbMNhWiYaNfFvEjX+taFvseh782sjXpkSLELlfWtC32PQ9+bWfu0aeXHcefmr4Ji6Pvp2qLm3JjZBqULCowyerqmIukESYvrzoeZ2YiaW6Sro/V6TWUOKSS/mVln65GaSq8XKs8BXlh3JszMRtMN/SVl9HShEhHz6s5DR1E1rZ0xUMFff1URnCtSVUTpCRWlc8delSTD1LdXMzT+vrOf2HIaW5z9+wpyAhOmbFRJOmufsksl6QDwi3NbT8OFipmZVcU1FTMzq0bgRbrMzKwawjUVMzOrkguVziRpAevWpj+0ca17M7NOpO5cef1Req5QMTPrOl0S16sMFyoFjlJsZnVxn0qHamjumt/k5UcCxwM4SrGZtVM3hGApo6djf42BoxSbWT16JPR9z9VUWuEoxWZWiy4Ja1+GCxUzs07gQsXMzKrgyY9mZlYpVRGotQO4UDEzq1uXdMKX4UJlBJWEeO8k0TljFmNt5+SlSmuXLK0knZ2/Wc1AkYe2nl1JOssOWN1yGvu+u/U0AH531JMqSWejqxdXkk5VemVIsQsVM7NO0CO/YV2omJl1AHfUm5lZNQLokYCSXTmjXtJiSQfXnQ8zs6pooNzW6VxTMTOrWS/NUxnXmoqkd0m6WdL9kpZKOiXv/4qkW/P+6yW9seG6l+X9qyT9UNKn8zopSLoQmAOclY9fnPdPlPRBSQslLZf0S0lPH8/3Z2ZWiYjyW4cbt0JF0jzgVODlETEN2B24IB++EngysBlwEjBf0uPzdY8FzgM+ko9/GnjLYLoR8QpgCfDWiJgaES/Oh04EXgXsTwpf/2XgIkmbN5HnmZLmSZoXvbJgtJl1BUW5rdONZ01lLalWt7ukqRGxPCJ+AxARX4qIeyKiPyK+BVwL7J2vewPw24j4ZkSsjYhLgfNHupEkAe8C/isiFuV0vwTcDrysiTwfCdwA3LCah5u4zMysRW2IUiypT9Jpku7OLUXfkzRrmHNfKukyScsk3SfpCknPHe0e41aoRMQi4CDgcOAfkq6U9GJJEySdJOkGSSskLQf2ALbMl84GbmlIrvF1o1nAVODC3PS1PKe7E7BdE9kuhL7fqInLzMxa06aayjGkFp09WffdePYw525O+k7cmfT9/A3gJ5K2H+kG49pRHxHnAedJmgwcQapxvDVvLwauj4gBSb8n1WoAluZjRXMaXje2TS0DHgD2jYirWsivQ9+bWfsF0F+6xOjL3QuD7snfXWW8DTgp/+hH0vuAGyXtEBHr/XiPiHMarv2CpOOBZwC3DneD8exT2VXS/pI2AdYAK0gf3XRS09jdwARJh5FqKoO+Bewp6d9yVe0FwAENyd8B7DL4IiIC+CzwSUm75PtPlbSfpMeM01s0M6tMEzWVrcnN9Hk7slT60makH+h/GNwXETcBK1n/O3i4659IahW6bqTzxrNPZTJwHKlfYzmpz+M1wFeB3wI3kmoljweuGLwoIm4EXkfqeF8BHEWqnhU7OU4GDs7tfD/J+44n1YTOl7QS+DupdtSVc3HMbANTfvTXneRm+rydXvIO0/Ljiob9y0k/9oclaSvge8AnI+LvI507bs1fEXEd8C/DHH7dKNdewLqRYkj6JoV+lYj4MfDjhmvWAp/Km5lZV2miv6Q/IhaO4Rb358cZDfs3I9VWhpRbe34GXAx8YLSbdOTkR0mvJA07XkkavfUaYL+2Z6SDovrahqXvupsqSWfqDZMqSWfe7SP2zZbyy633rCAnMPXaRZWkE2vXVpJOJdoQ+j4ilktaAjwVuBpA0k6kWsq1Q10jaUfgUuD7EXF0mft0ZKECPI80z2QKaU7KERFxeb1ZMjMbHwJUvqO+FWcA75d0OWlQ0seBn0bE4kflSdoNuASYHxHHlr1BR/Y3RMTRETErT258fER8ue48mZmNJ0WU2lp0KnAhcBWpT7sPOBhA0kGSVhXOfT9pisd7cvSSwe2gkW7QqTUVM7MNR5tWfoyIfuDovDUeOwc4p/D634F/b/YeLlTMzGrXHXG9ynChYmbWAbohrlcZPV+oSJoUEWvqzoeZ2Yh6pKbSkR31o5G0iaRP5rD690q6SNLO+dgCSZ+R9IM8CfKoJtJ1lGIza79Io7/KbJ2uKwsV4ExgN+BZwDakGfo/lDQ4KP8w4HOkST6fayJdRyk2s3q0IUpxO3RdoZLDNL8ReEdE3BkRq0khXbYlRd4EODciLovkwSaSd5RiM6tFm4YUj7tu7FOZmx+vTcuoPGISMDjtd/FYEnaUYjOrTRcUGGV0Y6EyGANsl4i4u/GgpLfz6ND4ZmadK+iZb62ua/6KiLtIi8V8XtJsSCGdJb1a0tR6c2dm1jxRrumrG5q/uq5QyQ4ndagvkHQ/Kb7/6+iKbiwzsyEMDJTbOlw3Nn+RO9+PzVujvdubGzOzFvVQ81dXFipmHaui5RL6H2hm0OIIVlX0TfW7xnWdmrfJ5MkVZATufPNTK0ln6188qkt27Ja3nkQ3NG2V4ULFzKwTuFAxM7NqOKCkmZlVJYAuCMFShgsVM7MO4D4VMzOrjgsVMzOrRAADLlR6jqSZwEyAqcyoOTdmtuHonY76bp1RP14c+t7M6hFRbutwLlTW59D3ZtZ+AfQPlNs6nJu/Chz63szqEZVFY6ibCxUzs07QBU1bZfRs85ekL0r6Sd35MDMb1eDorzJbh+vZmkpEHFF3HszMSuuRmkrPFipVUF9fS9dHf39FOTEbI1XTGNHq/wWo7v/Dyrmjn1PG1NetqSYhgH0rSMOFipmZVSICeuRHqAsVM7NO4JqKmZlVxoWKmZlVoztGdpXRkUOKJU2RtFLStnXnxcxs3AVEDJTaOl1HFCqS+iRtVdj1IuAvEXH7KNdtM745MzNrkx4J01JroSJpT0mfAW4DDi0cOgD4fj5nX0l/yjWXZZIuKZx3nKRFkj4qafcK8jNT0jxJ84LeqIqaWReIgIGBcluHa3uhIunxkj4i6UbgW8A/gZdExCfy8T7gFcAP8iVfAz4HzABmAycXknsn8KZ87FJJ10g6RtIOY8xeIUrxP8eYhJnZGDhKcXMkvU7S1cDPgGnAwRExNyKOiYirC6fuBSyLiIX59WrgscDWEfFwRCwYPDGSKyPiP0kFzlHAzsAfJf1S0guazGYhSvGUsbxNM7MxiYGBUluna2dNZTawE/AX4Brgb8Oc90jTV/YqYBfgOknXS3rPUBdFRH8h7RtJhcNWQ507nIi4JyIWRsRCoWYuNTNrQclaimsq60TEZ4CtgbOAVwJLJF0o6WBJ0wunHsC6pi8i4pqIOJBUQLwd+JikfQaPS9pS0hGSLgf+CjwTOBHYJiK+Pe5vzMysVT0UULKtfSoR8VBEfCciXg1sD5xH6hP5h6TDJO0BTAJ+DyBpsqQ3S5oVEQHcBwwA/fn4ccDNwH7A54FtI+KQiPhxRKxt53szMxurIMVGK7N1utomP0bECuArwFfycOItgdcB5+cCZNCBwH9LmgLcBRwfET/Px34IfDanZWbWncKLdFUqIu4C7pJ0DnB0Yf9q4KUjXPfHNmTPzGzcRRc0bZWh6JCOH0mTgWOAUzqh6UrS3cAtdefDzLrCDhGx5VgvlnQRMKvk6csiYv+x3mu8dUyhYmZm3a8jwrSYmVlvcKFiZmaVcaFiZmaVcaFiZmaVcaFiZmaVcaFiZmaVcaFiZmaVcaFiZmaVcaFiZmaV+f+MBdiOezSSbgAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<Figure size 432x288 with 2 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "# This plots a chosen sentence, for which we saved the attention scores above.\n",
    "idx = 5\n",
    "src = valid_data[idx].src + [\"</s>\"]\n",
    "trg = valid_data[idx].trg + [\"</s>\"]\n",
    "pred = hypotheses[idx].split() + [\"</s>\"]\n",
    "pred_att = alphas[idx][0].T[:, :len(pred)]\n",
    "print(\"src\", src)\n",
    "print(\"ref\", trg)\n",
    "print(\"pred\", pred)\n",
    "plot_heatmap(src, pred, pred_att)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Congratulations! You've finished this notebook.\n",
    "\n",
    "What didn't we cover?\n",
    "\n",
    "- Subwords / Byte Pair Encoding [[paper]](https://arxiv.org/abs/1508.07909) [[github]](https://github.com/rsennrich/subword-nmt) let you deal with unknown words. \n",
    "- You can implement a [multiplicative/bilinear attention mechanism](https://arxiv.org/abs/1508.04025) instead of the additive one used here.\n",
    "- We used greedy decoding here to get translations, but you can get better results with beam search.\n",
    "- The original model only uses a single dropout layer (in the decoder), but you can experiment with adding more dropout layers, for example on the word embeddings and the source word representations.\n",
    "- You can experiment with multiple encoder/decoder layers.",
    "- Experiment with a benchmarked and improved codebase: [Joey NMT](https://github.com/joeynmt/joeynmt)"
   ]
  },
  {
    "metadata": {},
    "cell_type": "markdown",
    "source": [
      "If this was useful to your research, please consider citing:\n",
      "\n",
      "> J Bastings. 2018. The Annotated Encoder-Decoder with Attention. https://bastings.github.io/annotated_encoder_decoder/\n",
      "\n",
      "Or use the following `Bibtex`:\n",
      "```\n",
      "@misc{bastings2018annotated,\n",
      "  title={The Annotated Encoder-Decoder with Attention},\n",
      "  author={Bastings, J.},\n",
      "  journal={https://bastings.github.io/annotated\\_encoder\\_decoder/},\n",
      "  year={2018}\n",
      "}```"
    ]
  }  
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: index.md
================================================
<script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script>

# The Annotated Encoder-Decoder with Attention

Recently, Alexander Rush wrote a blog post called [The Annotated Transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html), describing the Transformer model from the paper [Attention is All You Need](https://arxiv.org/abs/1706.03762). This post can be seen as a **prequel** to that: *we will implement an Encoder-Decoder with Attention* using (Gated) Recurrent Neural Networks, very closely following the original attention-based neural machine translation paper ["Neural Machine Translation by Jointly Learning to Align and Translate"](https://arxiv.org/abs/1409.0473) of Bahdanau et al. (2015). 

The idea is that going through both blog posts will make you familiar with two very influential sequence-to-sequence architectures. If you have any comments or suggestions, please let me know: [@BastingsJasmijn](https://twitter.com/BastingsJasmijn).

[Click here to open this notebook in Google Colab.](https://colab.research.google.com/github/bastings/annotated_encoder_decoder/blob/master/annotated_encoder_decoder.ipynb)

# Model Architecture

We will model the probability $$p(Y\mid X)$$ of a target sequence $$Y=(y_1, \dots, y_{N})$$ given a source sequence $$X=(x_1, \dots, x_M)$$ directly with a neural network: an Encoder-Decoder.

<img src="images/bahdanau.png" width="636">

#### Encoder 

The encoder reads in the source sentence (*at the bottom of the figure*) and produces a sequence of hidden states $$\mathbf{h}_1, \dots, \mathbf{h}_M$$, one for each source word. These states should capture the meaning of a word in its context of the given sentence.

We will use a bi-directional recurrent neural network (Bi-RNN) as the encoder; a Bi-GRU in particular.

First of all we **embed** the source words. 
We simply look up the **word embedding** for each word in a (randomly initialized) lookup table.
We will denote the word embedding for word $i$ in a given sentence with $\mathbf{x}_i$.
By embedding words, our model may exploit the fact that certain words (e.g. *cat* and *dog*) are semantically similar, and can be processed in a similar way.

Now, how do we get hidden states $$\mathbf{h}_1, \dots, \mathbf{h}_M$$? A forward GRU reads the source sentence left-to-right, while a backward GRU reads it right-to-left.
Each of them follows a simple recursive formula: 
$$\mathbf{h}_j = \text{GRU}( \mathbf{x}_j , \mathbf{h}_{j - 1} )$$
i.e. we obtain the next state from the previous state and the current input word embedding.

The hidden state of the forward GRU at time step j will know what words **precede** the word at that time step, but it doesn't know what words will follow. In contrast, the backward GRU will only know what words **follow** the word at time step j. By **concatenating** those two hidden states (*shown in blue in the figure*), we get $$\mathbf{h}_j$$, which captures word j in its full sentence context.


#### Decoder 

The decoder (*at the top of the figure*) is a GRU with hidden state $\mathbf{s_i}$. It follows a similar formula to the encoder, but takes one extra input $$\mathbf{c}_{i}$$ (*shown in yellow*).

$$\mathbf{s}_{i} = f( \mathbf{s}_{i - 1}, \mathbf{y}_{i - 1}, \mathbf{c}_i )$$

Here, $$\mathbf{y}_{i - 1}$$ is the previously generated target word (*not shown*).

At each time step, an **attention mechanism** dynamically selects that part of the source sentence that is most relevant for predicting the current target word. It does so by comparing the last decoder state with each source hidden state. The result is a context vector $\mathbf{c_i}$ (*shown in yellow*).
Later the attention mechanism is explained in more detail.

After computing the decoder state $\mathbf{s}_i$, a non-linear function $g$ (which applies a [softmax](https://en.wikipedia.org/wiki/Softmax_function)) gives us the probability of the target word $y_i$ for this time step:

$$ p(y_i \mid y_{<i}, x_1^M) = g(\mathbf{s}_i, \mathbf{c}_i, \mathbf{y}_{i - 1})$$

Because g applies a softmax, it provides a vector the size of the output vocabulary that sums to 1.0: it is a distribution over all target words. During test time, we would select the word with the highest probability for our translation.

Now, for optimization, a [cross-entropy loss](https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html#cross-entropy) is used to maximize the probability of selecting the correct word at this time step. All parameters (including word embeddings) are then updated to maximize this probability.


# Prelims

This tutorial requires **PyTorch >= 0.4.1** and was tested with **Python 3.6**.  

Make sure you have those versions, and install the packages below if you don't have them yet.


```python
#!pip install torch numpy matplotlib sacrebleu
```


```python
%matplotlib inline
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import math, copy, time
import matplotlib.pyplot as plt
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
from IPython.core.debugger import set_trace

# we will use CUDA if it is available
USE_CUDA = torch.cuda.is_available()
DEVICE=torch.device('cuda:0') # or set to 'cpu'
print("CUDA:", USE_CUDA)
print(DEVICE)

seed = 42
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
```

    CUDA: True
    cuda:0


# Let's start coding!

## Model class

Our base model class `EncoderDecoder` is very similar to the one in *The Annotated Transformer*.

One difference is that our encoder also returns its final states (`encoder_final` below), which is used to initialize the decoder RNN. We also provide the sequence lengths as the RNNs require those.


```python
class EncoderDecoder(nn.Module):
    """
    A standard Encoder-Decoder architecture. Base for this and many 
    other models.
    """
    def __init__(self, encoder, decoder, src_embed, trg_embed, generator):
        super(EncoderDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.src_embed = src_embed
        self.trg_embed = trg_embed
        self.generator = generator
        
    def forward(self, src, trg, src_mask, trg_mask, src_lengths, trg_lengths):
        """Take in and process masked src and target sequences."""
        encoder_hidden, encoder_final = self.encode(src, src_mask, src_lengths)
        return self.decode(encoder_hidden, encoder_final, src_mask, trg, trg_mask)
    
    def encode(self, src, src_mask, src_lengths):
        return self.encoder(self.src_embed(src), src_mask, src_lengths)
    
    def decode(self, encoder_hidden, encoder_final, src_mask, trg, trg_mask,
               decoder_hidden=None):
        return self.decoder(self.trg_embed(trg), encoder_hidden, encoder_final,
                            src_mask, trg_mask, hidden=decoder_hidden)
```

To keep things easy we also keep the `Generator` class the same. 
It simply projects the pre-output layer (x in the `forward` function below) to obtain the output layer, so that the final dimension is the target vocabulary size.


```python
class Generator(nn.Module):
    """Define standard linear + softmax generation step."""
    def __init__(self, hidden_size, vocab_size):
        super(Generator, self).__init__()
        self.proj = nn.Linear(hidden_size, vocab_size, bias=False)

    def forward(self, x):
        return F.log_softmax(self.proj(x), dim=-1)
```

## Encoder

Our encoder is a bi-directional GRU. 

Because we want to process multiple sentences at the same time for speed reasons (it is more effcient on GPU), we need to support **mini-batches**. Sentences in a mini-batch may have different lengths, which means that the RNN needs to unroll further for certain sentences while it might already have finished for others:

```
Example: mini-batch with 3 source sentences of different lengths (7, 5, and 3).
End-of-sequence is marked with a "3" here, and padding positions with "1".

+---------------+
| 4 5 9 8 7 8 3 |
+---------------+
| 5 4 8 7 3 1 1 |
+---------------+
| 5 8 3 1 1 1 1 |
+---------------+
```
You can see that, when computing hidden states for this mini-batch, for sentence #2 and #3 we will need to stop updating the hidden state after we have encountered "3". We don't want to incorporate the padding values (1s).

Luckily, PyTorch has convenient helper functions called `pack_padded_sequence` and `pad_packed_sequence`.
These functions take care of masking and padding, so that the resulting word representations are simply zeros after a sentence stops.

The code below reads in a source sentence (a sequence of word embeddings) and produces the hidden states.
It also returns a final vector, a summary of the complete sentence, by concatenating the first and the last hidden states (they have both seen the whole sentence, each in a different direction). We will use the final vector to initialize the decoder.


```python
class Encoder(nn.Module):
    """Encodes a sequence of word embeddings"""
    def __init__(self, input_size, hidden_size, num_layers=1, dropout=0.):
        super(Encoder, self).__init__()
        self.num_layers = num_layers
        self.rnn = nn.GRU(input_size, hidden_size, num_layers, 
                          batch_first=True, bidirectional=True, dropout=dropout)
        
    def forward(self, x, mask, lengths):
        """
        Applies a bidirectional GRU to sequence of embeddings x.
        The input mini-batch x needs to be sorted by length.
        x should have dimensions [batch, time, dim].
        """
        packed = pack_padded_sequence(x, lengths, batch_first=True)
        output, final = self.rnn(packed)
        output, _ = pad_packed_sequence(output, batch_first=True)

        # we need to manually concatenate the final states for both directions
        fwd_final = final[0:final.size(0):2]
        bwd_final = final[1:final.size(0):2]
        final = torch.cat([fwd_final, bwd_final], dim=2)  # [num_layers, batch, 2*dim]

        return output, final
```

### Decoder

The decoder is a conditional GRU. Rather than starting with an empty state like the encoder, its initial hidden state results from a projection of the encoder final vector. 

#### Training
In `forward` you can find a for-loop that computes the decoder hidden states one time step at a time. 
Note that, during training, we know exactly what the target words should be! (They are in `trg_embed`.) This means that we are not even checking here what the prediction is! We simply feed the correct previous target word embedding to the GRU at each time step. This is called teacher forcing.

The `forward` function returns all decoder hidden states and pre-output vectors. Elsewhere these are used to compute the loss, after which the parameters are updated.

#### Prediction
For prediction time, for forward function is only used for a single time step. After predicting a word from the returned pre-output vector, we can call it again, supplying it the word embedding of the previously predicted word and the last state.


```python
class Decoder(nn.Module):
    """A conditional RNN decoder with attention."""
    
    def __init__(self, emb_size, hidden_size, attention, num_layers=1, dropout=0.5,
                 bridge=True):
        super(Decoder, self).__init__()
        
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.attention = attention
        self.dropout = dropout
                 
        self.rnn = nn.GRU(emb_size + 2*hidden_size, hidden_size, num_layers,
                          batch_first=True, dropout=dropout)
                 
        # to initialize from the final encoder state
        self.bridge = nn.Linear(2*hidden_size, hidden_size, bias=True) if bridge else None

        self.dropout_layer = nn.Dropout(p=dropout)
        self.pre_output_layer = nn.Linear(hidden_size + 2*hidden_size + emb_size,
                                          hidden_size, bias=False)
        
    def forward_step(self, prev_embed, encoder_hidden, src_mask, proj_key, hidden):
        """Perform a single decoder step (1 word)"""

        # compute context vector using attention mechanism
        query = hidden[-1].unsqueeze(1)  # [#layers, B, D] -> [B, 1, D]
        context, attn_probs = self.attention(
            query=query, proj_key=proj_key,
            value=encoder_hidden, mask=src_mask)

        # update rnn hidden state
        rnn_input = torch.cat([prev_embed, context], dim=2)
        output, hidden = self.rnn(rnn_input, hidden)
        
        pre_output = torch.cat([prev_embed, output, context], dim=2)
        pre_output = self.dropout_layer(pre_output)
        pre_output = self.pre_output_layer(pre_output)

        return output, hidden, pre_output
    
    def forward(self, trg_embed, encoder_hidden, encoder_final, 
                src_mask, trg_mask, hidden=None, max_len=None):
        """Unroll the decoder one step at a time."""
                                         
        # the maximum number of steps to unroll the RNN
        if max_len is None:
            max_len = trg_mask.size(-1)

        # initialize decoder hidden state
        if hidden is None:
            hidden = self.init_hidden(encoder_final)
        
        # pre-compute projected encoder hidden states
        # (the "keys" for the attention mechanism)
        # this is only done for efficiency
        proj_key = self.attention.key_layer(encoder_hidden)
        
        # here we store all intermediate hidden states and pre-output vectors
        decoder_states = []
        pre_output_vectors = []
        
        # unroll the decoder RNN for max_len steps
        for i in range(max_len):
            prev_embed = trg_embed[:, i].unsqueeze(1)
            output, hidden, pre_output = self.forward_step(
              prev_embed, encoder_hidden, src_mask, proj_key, hidden)
            decoder_states.append(output)
            pre_output_vectors.append(pre_output)

        decoder_states = torch.cat(decoder_states, dim=1)
        pre_output_vectors = torch.cat(pre_output_vectors, dim=1)
        return decoder_states, hidden, pre_output_vectors  # [B, N, D]

    def init_hidden(self, encoder_final):
        """Returns the initial decoder state,
        conditioned on the final encoder state."""

        if encoder_final is None:
            return None  # start with zeros

        return torch.tanh(self.bridge(encoder_final))            

```

### Attention                                                                                                                                                                               

At every time step, the decoder has access to *all* source word representations $$\mathbf{h}_1, \dots, \mathbf{h}_M$$. 
An attention mechanism allows the model to focus on the currently most relevant part of the source sentence.
The state of the decoder is represented by GRU hidden state $$\mathbf{s}_i$$.
So if we want to know which source word representation(s) $$\mathbf{h}_j$$ are most relevant, we will need to define a function that takes those two things as input.

Here we use the MLP-based, additive attention that was used in Bahdanau et al.:

<img src="images/attention.png" width="280">


We apply an MLP with tanh-activation to both the current decoder state $$\bf s_i$$ (the *query*) and each encoder state $$\bf h_j$$ (the *key*), and then project this to a single value (i.e. a scalar) to get the *attention energy* $$e_{ij}$$. 

Once all energies are computed, they are normalized by a softmax so that they sum to one: 

$$ \alpha_{ij} = \text{softmax}(\mathbf{e}_i)[j] $$

$$\sum_j \alpha_{ij} = 1.0$$ 

The context vector for time step $i$ is then a weighted sum of the encoder hidden states (the *values*):
$$\mathbf{c}_i = \sum_j \alpha_{ij} \mathbf{h}_j$$


```python
class BahdanauAttention(nn.Module):
    """Implements Bahdanau (MLP) attention"""
    
    def __init__(self, hidden_size, key_size=None, query_size=None):
        super(BahdanauAttention, self).__init__()
        
        # We assume a bi-directional encoder so key_size is 2*hidden_size
        key_size = 2 * hidden_size if key_size is None else key_size
        query_size = hidden_size if query_size is None else query_size

        self.key_layer = nn.Linear(key_size, hidden_size, bias=False)
        self.query_layer = nn.Linear(query_size, hidden_size, bias=False)
        self.energy_layer = nn.Linear(hidden_size, 1, bias=False)
        
        # to store attention scores
        self.alphas = None
        
    def forward(self, query=None, proj_key=None, value=None, mask=None):
        assert mask is not None, "mask is required"

        # We first project the query (the decoder state).
        # The projected keys (the encoder states) were already pre-computated.
        query = self.query_layer(query)
        
        # Calculate scores.
        scores = self.energy_layer(torch.tanh(query + proj_key))
        scores = scores.squeeze(2).unsqueeze(1)
        
        # Mask out invalid positions.
        # The mask marks valid positions so we invert it using `mask & 0`.
        scores.data.masked_fill_(mask == 0, -float('inf'))
        
        # Turn scores to probabilities.
        alphas = F.softmax(scores, dim=-1)
        self.alphas = alphas        
        
        # The context vector is the weighted sum of the values.
        context = torch.bmm(alphas, value)
        
        # context shape: [B, 1, 2D], alphas shape: [B, 1, M]
        return context, alphas
```

## Embeddings and Softmax                                                                                                                                                                                                                                                                                           
We use learned embeddings to convert the input tokens and output tokens to vectors of dimension `emb_size`.

We will simply use PyTorch's [nn.Embedding](https://pytorch.org/docs/stable/nn.html?highlight=embedding#torch.nn.Embedding) class.

## Full Model

Here we define a function from hyperparameters to a full model. 


```python
def make_model(src_vocab, tgt_vocab, emb_size=256, hidden_size=512, num_layers=1, dropout=0.1):
    "Helper: Construct a model from hyperparameters."

    attention = BahdanauAttention(hidden_size)

    model = EncoderDecoder(
        Encoder(emb_size, hidden_size, num_layers=num_layers, dropout=dropout),
        Decoder(emb_size, hidden_size, attention, num_layers=num_layers, dropout=dropout),
        nn.Embedding(src_vocab, emb_size),
        nn.Embedding(tgt_vocab, emb_size),
        Generator(hidden_size, tgt_vocab))

    return model.cuda() if USE_CUDA else model
```

# Training

This section describes the training regime for our models.

We stop for a quick interlude to introduce some of the tools 
needed to train a standard encoder decoder model. First we define a batch object that holds the src and target sentences for training, as well as their lengths and masks. 

## Batches and Masking


```python
class Batch:
    """Object for holding a batch of data with mask during training.
    Input is a batch from a torch text iterator.
    """
    def __init__(self, src, trg, pad_index=0):
        
        src, src_lengths = src
        
        self.src = src
        self.src_lengths = src_lengths
        self.src_mask = (src != pad_index).unsqueeze(-2)
        self.nseqs = src.size(0)
        
        self.trg = None
        self.trg_y = None
        self.trg_mask = None
        self.trg_lengths = None
        self.ntokens = None

        if trg is not None:
            trg, trg_lengths = trg
            self.trg = trg[:, :-1]
            self.trg_lengths = trg_lengths
            self.trg_y = trg[:, 1:]
            self.trg_mask = (self.trg_y != pad_index)
            self.ntokens = (self.trg_y != pad_index).data.sum().item()
        
        if USE_CUDA:
            self.src = self.src.cuda()
            self.src_mask = self.src_mask.cuda()

            if trg is not None:
                self.trg = self.trg.cuda()
                self.trg_y = self.trg_y.cuda()
                self.trg_mask = self.trg_mask.cuda()
                
```

## Training Loop
The code below trains the model for 1 epoch (=1 pass through the training data).


```python
def run_epoch(data_iter, model, loss_compute, print_every=50):
    """Standard Training and Logging Function"""

    start = time.time()
    total_tokens = 0
    total_loss = 0
    print_tokens = 0

    for i, batch in enumerate(data_iter, 1):
        
        out, _, pre_output = model.forward(batch.src, batch.trg,
                                           batch.src_mask, batch.trg_mask,
                                           batch.src_lengths, batch.trg_lengths)
        loss = loss_compute(pre_output, batch.trg_y, batch.nseqs)
        total_loss += loss
        total_tokens += batch.ntokens
        print_tokens += batch.ntokens
        
        if model.training and i % print_every == 0:
            elapsed = time.time() - start
            print("Epoch Step: %d Loss: %f Tokens per Sec: %f" %
                    (i, loss / batch.nseqs, print_tokens / elapsed))
            start = time.time()
            print_tokens = 0

    return math.exp(total_loss / float(total_tokens))
```

## Training Data and Batching

We will use torch text for batching. This is discussed in more detail below. 

## Optimizer

We will use the [Adam optimizer](https://arxiv.org/abs/1412.6980) with default settings ($$\beta_1=0.9$$, $$\beta_2=0.999$$ and $$\epsilon=10^{-8}$$).

We will use 0.0003 as the learning rate here, but for different problems another learning rate may be more appropriate. You will have to tune that.

# A First  Example

We can begin by trying out a simple copy-task. Given a random set of input symbols from a small vocabulary, the goal is to generate back those same symbols. 

## Synthetic Data


```python
def data_gen(num_words=11, batch_size=16, num_batches=100, length=10, pad_index=0, sos_index=1):
    """Generate random data for a src-tgt copy task."""
    for i in range(num_batches):
        data = torch.from_numpy(
          np.random.randint(1, num_words, size=(batch_size, length)))
        data[:, 0] = sos_index
        data = data.cuda() if USE_CUDA else data
        src = data[:, 1:]
        trg = data
        src_lengths = [length-1] * batch_size
        trg_lengths = [length] * batch_size
        yield Batch((src, src_lengths), (trg, trg_lengths), pad_index=pad_index)
```

## Loss Computation


```python
class SimpleLossCompute:
    """A simple loss compute and train function."""

    def __init__(self, generator, criterion, opt=None):
        self.generator = generator
        self.criterion = criterion
        self.opt = opt

    def __call__(self, x, y, norm):
        x = self.generator(x)
        loss = self.criterion(x.contiguous().view(-1, x.size(-1)),
                              y.contiguous().view(-1))
        loss = loss / norm

        if self.opt is not None:
            loss.backward()          
            self.opt.step()
            self.opt.zero_grad()

        return loss.data.item() * norm
```

### Printing examples

To monitor progress during training, we will translate a few examples.

We use greedy decoding for simplicity; that is, at each time step, starting at the first token, we choose the one with that maximum probability, and we never revisit that choice. 


```python
def greedy_decode(model, src, src_mask, src_lengths, max_len=100, sos_index=1, eos_index=None):
    """Greedily decode a sentence."""

    with torch.no_grad():
        encoder_hidden, encoder_final = model.encode(src, src_mask, src_lengths)
        prev_y = torch.ones(1, 1).fill_(sos_index).type_as(src)
        trg_mask = torch.ones_like(prev_y)

    output = []
    attention_scores = []
    hidden = None

    for i in range(max_len):
        with torch.no_grad():
            out, hidden, pre_output = model.decode(
              encoder_hidden, encoder_final, src_mask,
              prev_y, trg_mask, hidden)

            # we predict from the pre-output layer, which is
            # a combination of Decoder state, prev emb, and context
            prob = model.generator(pre_output[:, -1])

        _, next_word = torch.max(prob, dim=1)
        next_word = next_word.data.item()
        output.append(next_word)
        prev_y = torch.ones(1, 1).type_as(src).fill_(next_word)
        attention_scores.append(model.decoder.attention.alphas.cpu().numpy())
    
    output = np.array(output)
        
    # cut off everything starting from </s> 
    # (only when eos_index provided)
    if eos_index is not None:
        first_eos = np.where(output==eos_index)[0]
        if len(first_eos) > 0:
            output = output[:first_eos[0]]      
    
    return output, np.concatenate(attention_scores, axis=1)
  

def lookup_words(x, vocab=None):
    if vocab is not None:
        x = [vocab.itos[i] for i in x]

    return [str(t) for t in x]
```


```python
def print_examples(example_iter, model, n=2, max_len=100, 
                   sos_index=1, 
                   src_eos_index=None, 
                   trg_eos_index=None, 
                   src_vocab=None, trg_vocab=None):
    """Prints N examples. Assumes batch size of 1."""

    model.eval()
    count = 0
    print()
    
    if src_vocab is not None and trg_vocab is not None:
        src_eos_index = src_vocab.stoi[EOS_TOKEN]
        trg_sos_index = trg_vocab.stoi[SOS_TOKEN]
        trg_eos_index = trg_vocab.stoi[EOS_TOKEN]
    else:
        src_eos_index = None
        trg_sos_index = 1
        trg_eos_index = None
        
    for i, batch in enumerate(example_iter):
      
        src = batch.src.cpu().numpy()[0, :]
        trg = batch.trg_y.cpu().numpy()[0, :]

        # remove </s> (if it is there)
        src = src[:-1] if src[-1] == src_eos_index else src
        trg = trg[:-1] if trg[-1] == trg_eos_index else trg      
      
        result, _ = greedy_decode(
          model, batch.src, batch.src_mask, batch.src_lengths,
          max_len=max_len, sos_index=trg_sos_index, eos_index=trg_eos_index)
        print("Example #%d" % (i+1))
        print("Src : ", " ".join(lookup_words(src, vocab=src_vocab)))
        print("Trg : ", " ".join(lookup_words(trg, vocab=trg_vocab)))
        print("Pred: ", " ".join(lookup_words(result, vocab=trg_vocab)))
        print()
        
        count += 1
        if count == n:
            break
```

## Training the copy task


```python
def train_copy_task():
    """Train the simple copy task."""
    num_words = 11
    criterion = nn.NLLLoss(reduction="sum", ignore_index=0)
    model = make_model(num_words, num_words, emb_size=32, hidden_size=64)
    optim = torch.optim.Adam(model.parameters(), lr=0.0003)
    eval_data = list(data_gen(num_words=num_words, batch_size=1, num_batches=100))
 
    dev_perplexities = []
    
    if USE_CUDA:
        model.cuda()

    for epoch in range(10):
        
        print("Epoch %d" % epoch)

        # train
        model.train()
        data = data_gen(num_words=num_words, batch_size=32, num_batches=100)
        run_epoch(data, model,
                  SimpleLossCompute(model.generator, criterion, optim))

        # evaluate
        model.eval()
        with torch.no_grad(): 
            perplexity = run_epoch(eval_data, model,
                                   SimpleLossCompute(model.generator, criterion, None))
            print("Evaluation perplexity: %f" % perplexity)
            dev_perplexities.append(perplexity)
            print_examples(eval_data, model, n=2, max_len=9)
        
    return dev_perplexities
```


```python
# train the copy task
dev_perplexities = train_copy_task()

def plot_perplexity(perplexities):
    """plot perplexities"""
    plt.title("Perplexity per Epoch")
    plt.xlabel("Epoch")
    plt.ylabel("Perplexity")
    plt.plot(perplexities)
    
plot_perplexity(dev_perplexities)
```

    /home/jb/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/rnn.py:38: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.1 and num_layers=1
      "num_layers={}".format(dropout, num_layers))


    Epoch 0
    Epoch Step: 50 Loss: 19.887581 Tokens per Sec: 7748.957397
    Epoch Step: 100 Loss: 17.856726 Tokens per Sec: 7925.338918
    Evaluation perplexity: 7.172198
    
    Example #1
    Src :  4 8 5 7 10 3 7 8 5
    Trg :  4 8 5 7 10 3 7 8 5
    Pred:  8 3 7 5 8 3 7 5 8
    
    Example #2
    Src :  8 8 3 6 5 2 8 6 2
    Trg :  8 8 3 6 5 2 8 6 2
    Pred:  8 8 8 8 8 8 8 8 8
    
    Epoch 1
    Epoch Step: 50 Loss: 15.715487 Tokens per Sec: 8662.903188
    Epoch Step: 100 Loss: 12.368280 Tokens per Sec: 7860.172940
    Evaluation perplexity: 3.709498
    
    Example #1
    Src :  4 8 5 7 10 3 7 8 5
    Trg :  4 8 5 7 10 3 7 8 5
    Pred:  4 8 7 5 10 8 7 5 7
    
    Example #2
    Src :  8 8 3 6 5 2 8 6 2
    Trg :  8 8 3 6 5 2 8 6 2
    Pred:  8 8 5 6 2 6 8 2 5
    
    Epoch 2
    Epoch Step: 50 Loss: 9.246480 Tokens per Sec: 7971.095313
    Epoch Step: 100 Loss: 7.701921 Tokens per Sec: 7876.198908
    Evaluation perplexity: 2.303158
    
    Example #1
    Src :  4 8 5 7 10 3 7 8 5
    Trg :  4 8 5 7 10 3 7 8 5
    Pred:  4 8 7 3 10 5 8 7 5
    
    Example #2
    Src :  8 8 3 6 5 2 8 6 2
    Trg :  8 8 3 6 5 2 8 6 2
    Pred:  8 8 5 6 2 6 8 5 2
    
    Epoch 3
    Epoch Step: 50 Loss: 6.166847 Tokens per Sec: 8069.631171
    Epoch Step: 100 Loss: 5.673258 Tokens per Sec: 7855.858586
    Evaluation perplexity: 1.775795
    
    Example #1
    Src :  4 8 5 7 10 3 7 8 5
    Trg :  4 8 5 7 10 3 7 8 5
    Pred:  4 8 7 5 10 3 7 8 5
    
    Example #2
    Src :  8 8 3 6 5 2 8 6 2
    Trg :  8 8 3 6 5 2 8 6 2
    Pred:  8 8 3 6 5 2 8 6 8
    
    Epoch 4
    Epoch Step: 50 Loss: 4.830031 Tokens per Sec: 8094.515152
    Epoch Step: 100 Loss: 4.152125 Tokens per Sec: 7999.315744
    Evaluation perplexity: 1.572305
    
    Example #1
    Src :  4 8 5 7 10 3 7 8 5
    Trg :  4 8 5 7 10 3 7 8 5
    Pred:  4 8 5 7 10 3 7 8 5
    
    Example #2
    Src :  8 8 3 6 5 2 8 6 2
    Trg :  8 8 3 6 5 2 8 6 2
    Pred:  8 8 3 6 5 2 8 6 2
    
    Epoch 5
    Epoch Step: 50 Loss: 3.638369 Tokens per Sec: 8112.868501
    Epoch Step: 100 Loss: 3.784709 Tokens per Sec: 7843.288141
    Evaluation perplexity: 1.433951
    
    Example #1
    Src :  4 8 5 7 10 3 7 8 5
    Trg :  4 8 5 7 10 3 7 8 5
    Pred:  4 8 7 5 3 10 7 8 7
    
    Example #2
    Src :  8 8 3 6 5 2 8 6 2
    Trg :  8 8 3 6 5 2 8 6 2
    Pred:  8 8 3 6 5 2 8 6 2
    
    Epoch 6
    Epoch Step: 50 Loss: 2.802792 Tokens per Sec: 8128.952327
    Epoch Step: 100 Loss: 2.403310 Tokens per Sec: 7893.746819
    Evaluation perplexity: 1.284198
    
    Example #1
    Src :  4 8 5 7 10 3 7 8 5
    Trg :  4 8 5 7 10 3 7 8 5
    Pred:  4 8 5 7 10 3 7 8 5
    
    Example #2
    Src :  8 8 3 6 5 2 8 6 2
    Trg :  8 8 3 6 5 2 8 6 2
    Pred:  8 8 3 6 5 2 8 6 2
    
    Epoch 7
    Epoch Step: 50 Loss: 2.174423 Tokens per Sec: 8181.341663
    Epoch Step: 100 Loss: 1.838792 Tokens per Sec: 7833.160747
    Evaluation perplexity: 1.173110
    
    Example #1
    Src :  4 8 5 7 10 3 7 8 5
    Trg :  4 8 5 7 10 3 7 8 5
    Pred:  4 8 5 7 10 3 7 8 5
    
    Example #2
    Src :  8 8 3 6 5 2 8 6 2
    Trg :  8 8 3 6 5 2 8 6 2
    Pred:  8 8 3 6 5 2 8 6 2
    
    Epoch 8
    Epoch Step: 50 Loss: 1.226522 Tokens per Sec: 8267.548130
    Epoch Step: 100 Loss: 1.090876 Tokens per Sec: 7842.856308
    Evaluation perplexity: 1.123090
    
    Example #1
    Src :  4 8 5 7 10 3 7 8 5
    Trg :  4 8 5 7 10 3 7 8 5
    Pred:  4 8 5 7 10 3 7 8 5
    
    Example #2
    Src :  8 8 3 6 5 2 8 6 2
    Trg :  8 8 3 6 5 2 8 6 2
    Pred:  8 8 3 6 5 2 8 6 2
    
    Epoch 9
    Epoch Step: 50 Loss: 1.216270 Tokens per Sec: 8181.132215
    Epoch Step: 100 Loss: 0.636999 Tokens per Sec: 7866.309111
    Evaluation perplexity: 1.088564
    
    Example #1
    Src :  4 8 5 7 10 3 7 8 5
    Trg :  4 8 5 7 10 3 7 8 5
    Pred:  4 8 5 7 10 3 7 8 5
    
    Example #2
    Src :  8 8 3 6 5 2 8 6 2
    Trg :  8 8 3 6 5 2 8 6 2
    Pred:  8 8 3 6 5 2 8 6 2
    

![png](images/output_36_2.png)


You can see that the model managed to correctly 'translate' the two examples in the end.

Moreover, the perplexity of the development data nicely went down towards 1.

# A Real World Example

Now we consider a real-world example using the IWSLT German-English Translation task. 
This task is much smaller than usual, but it illustrates the whole system. 

The cell below installs torch text and spacy. This might take a while.


```python
#!pip install git+git://github.com/pytorch/text spacy 
#!python -m spacy download en
#!python -m spacy download de
```

## Data Loading

We will load the dataset using torchtext and spacy for tokenization.

This cell might take a while to run the first time, as it will download and tokenize the IWSLT data.

For speed we only include short sentences, and we include a word in the vocabulary only if it occurs at least 5 times. In this case we also lowercase the data.

If you have **issues** with torch text in the cell below (e.g. an `ascii` error), try running `export LC_ALL="en_US.UTF-8"` before you start `jupyter notebook`.


```python
# For data loading.
from torchtext import data, datasets

if True:
    import spacy
    spacy_de = spacy.load('de')
    spacy_en = spacy.load('en')

    def tokenize_de(text):
        return [tok.text for tok in spacy_de.tokenizer(text)]

    def tokenize_en(text):
        return [tok.text for tok in spacy_en.tokenizer(text)]

    UNK_TOKEN = "<unk>"
    PAD_TOKEN = "<pad>"    
    SOS_TOKEN = "<s>"
    EOS_TOKEN = "</s>"
    LOWER = True
    
    # we include lengths to provide to the RNNs
    SRC = data.Field(tokenize=tokenize_de, 
                     batch_first=True, lower=LOWER, include_lengths=True,
                     unk_token=UNK_TOKEN, pad_token=PAD_TOKEN, init_token=None, eos_token=EOS_TOKEN)
    TRG = data.Field(tokenize=tokenize_en, 
                     batch_first=True, lower=LOWER, include_lengths=True,
                     unk_token=UNK_TOKEN, pad_token=PAD_TOKEN, init_token=SOS_TOKEN, eos_token=EOS_TOKEN)

    MAX_LEN = 25  # NOTE: we filter out a lot of sentences for speed
    train_data, valid_data, test_data = datasets.IWSLT.splits(
        exts=('.de', '.en'), fields=(SRC, TRG), 
        filter_pred=lambda x: len(vars(x)['src']) <= MAX_LEN and 
            len(vars(x)['trg']) <= MAX_LEN)
    MIN_FREQ = 5  # NOTE: we limit the vocabulary to frequent words for speed
    SRC.build_vocab(train_data.src, min_freq=MIN_FREQ)
    TRG.build_vocab(train_data.trg, min_freq=MIN_FREQ)
    
    PAD_INDEX = TRG.vocab.stoi[PAD_TOKEN]

```

### Let's look at the data

It never hurts to look at your data and some statistics.


```python
def print_data_info(train_data, valid_data, test_data, src_field, trg_field):
    """ This prints some useful stuff about our data sets. """

    print("Data set sizes (number of sentence pairs):")
    print('train', len(train_data))
    print('valid', len(valid_data))
    print('test', len(test_data), "\n")

    print("First training example:")
    print("src:", " ".join(vars(train_data[0])['src']))
    print("trg:", " ".join(vars(train_data[0])['trg']), "\n")

    print("Most common words (src):")
    print("\n".join(["%10s %10d" % x for x in src_field.vocab.freqs.most_common(10)]), "\n")
    print("Most common words (trg):")
    print("\n".join(["%10s %10d" % x for x in trg_field.vocab.freqs.most_common(10)]), "\n")

    print("First 10 words (src):")
    print("\n".join(
        '%02d %s' % (i, t) for i, t in enumerate(src_field.vocab.itos[:10])), "\n")
    print("First 10 words (trg):")
    print("\n".join(
        '%02d %s' % (i, t) for i, t in enumerate(trg_field.vocab.itos[:10])), "\n")

    print("Number of German words (types):", len(src_field.vocab))
    print("Number of English words (types):", len(trg_field.vocab), "\n")
    
    
print_data_info(train_data, valid_data, test_data, SRC, TRG)
```

    Data set sizes (number of sentence pairs):
    train 143116
    valid 690
    test 963 
    
    First training example:
    src: david gallo : das ist bill lange . ich bin dave gallo .
    trg: david gallo : this is bill lange . i 'm dave gallo . 
    
    Most common words (src):
             .     138325
             ,     105944
           und      41839
           die      40809
           das      33324
           sie      33035
           ich      31153
           ist      31035
            es      27449
           wir      25817 
    
    Most common words (trg):
             .     137259
             ,      91619
           the      73344
           and      50273
            to      42798
             a      39573
            of      39496
             i      33524
            it      32921
          that      32643 
    
    First 10 words (src):
    00 <unk>
    01 <pad>
    02 </s>
    03 .
    04 ,
    05 und
    06 die
    07 das
    08 sie
    09 ich 
    
    First 10 words (trg):
    00 <unk>
    01 <pad>
    02 <s>
    03 </s>
    04 .
    05 ,
    06 the
    07 and
    08 to
    09 a 
    
    Number of German words (types): 15761
    Number of English words (types): 13003 
    

## Iterators
Batching matters a ton for speed. We will use torch text's BucketIterator here to get batches containing sentences of (almost) the same length.

#### Note on sorting batches for RNNs in PyTorch

For effiency reasons, PyTorch RNNs require that batches have been sorted by length, with the longest sentence in the batch first. For training, we simply sort each batch. 
For validation, we would run into trouble if we want to compare our translations with some external file that was not sorted. Therefore we simply set the validation batch size to 1, so that we can keep it in the original order.


```python
train_iter = data.BucketIterator(train_data, batch_size=64, train=True, 
                                 sort_within_batch=True, 
                                 sort_key=lambda x: (len(x.src), len(x.trg)), repeat=False,
                                 device=DEVICE)
valid_iter = data.Iterator(valid_data, batch_size=1, train=False, sort=False, repeat=False, 
                           device=DEVICE)


def rebatch(pad_idx, batch):
    """Wrap torchtext batch into our own Batch class for pre-processing"""
    return Batch(batch.src, batch.trg, pad_idx)
```

## Training the System

Now we train the model. 

On a Titan X GPU, this runs at ~18,000 tokens per second with a batch size of 64.


```python
def train(model, num_epochs=10, lr=0.0003, print_every=100):
    """Train a model on IWSLT"""
    
    if USE_CUDA:
        model.cuda()

    # optionally add label smoothing; see the Annotated Transformer
    criterion = nn.NLLLoss(reduction="sum", ignore_index=PAD_INDEX)
    optim = torch.optim.Adam(model.parameters(), lr=lr)
    
    dev_perplexities = []

    for epoch in range(num_epochs):
      
        print("Epoch", epoch)
        model.train()
        train_perplexity = run_epoch((rebatch(PAD_INDEX, b) for b in train_iter), 
                                     model,
                                     SimpleLossCompute(model.generator, criterion, optim),
                                     print_every=print_every)
        
        model.eval()
        with torch.no_grad():
            print_examples((rebatch(PAD_INDEX, x) for x in valid_iter), 
                           model, n=3, src_vocab=SRC.vocab, trg_vocab=TRG.vocab)        

            dev_perplexity = run_epoch((rebatch(PAD_INDEX, b) for b in valid_iter), 
                                       model, 
                                       SimpleLossCompute(model.generator, criterion, None))
            print("Validation perplexity: %f" % dev_perplexity)
            dev_perplexities.append(dev_perplexity)
        
    return dev_perplexities
        
```


```python
model = make_model(len(SRC.vocab), len(TRG.vocab),
                   emb_size=256, hidden_size=256,
                   num_layers=1, dropout=0.2)
dev_perplexities = train(model, print_every=100)
```

    Epoch 0


    /home/jb/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/rnn.py:38: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.2 and num_layers=1
      "num_layers={}".format(dropout, num_layers))


    Epoch Step: 100 Loss: 22.353386 Tokens per Sec: 16007.731248
    Epoch Step: 200 Loss: 34.410126 Tokens per Sec: 16368.906298
    Epoch Step: 300 Loss: 44.763870 Tokens per Sec: 16586.324787
    Epoch Step: 400 Loss: 57.584606 Tokens per Sec: 16717.486756
    Epoch Step: 500 Loss: 40.508701 Tokens per Sec: 16486.886104
    Epoch Step: 600 Loss: 51.919121 Tokens per Sec: 16529.862635
    Epoch Step: 700 Loss: 82.279633 Tokens per Sec: 16973.462052
    Epoch Step: 800 Loss: 35.026432 Tokens per Sec: 16724.939524
    Epoch Step: 900 Loss: 63.407204 Tokens per Sec: 16606.524355
    Epoch Step: 1000 Loss: 37.909828 Tokens per Sec: 19105.497130
    Epoch Step: 1100 Loss: 90.584244 Tokens per Sec: 19643.264684
    Epoch Step: 1200 Loss: 84.000832 Tokens per Sec: 19468.084935
    Epoch Step: 1300 Loss: 54.331242 Tokens per Sec: 19679.282614
    Epoch Step: 1400 Loss: 49.921040 Tokens per Sec: 19629.820942
    Epoch Step: 1500 Loss: 21.851797 Tokens per Sec: 19565.639729
    Epoch Step: 1600 Loss: 55.154270 Tokens per Sec: 19515.738007
    Epoch Step: 1700 Loss: 40.758137 Tokens per Sec: 19486.791554
    Epoch Step: 1800 Loss: 50.094219 Tokens per Sec: 19761.236905
    Epoch Step: 1900 Loss: 90.545143 Tokens per Sec: 19447.650965
    Epoch Step: 2000 Loss: 22.882494 Tokens per Sec: 19539.331538
    Epoch Step: 2100 Loss: 99.448174 Tokens per Sec: 19278.704892
    Epoch Step: 2200 Loss: 16.793839 Tokens per Sec: 19183.702688
    
    Example #1
    Src :  als ich 11 jahre alt war , wurde ich eines morgens von den <unk> heller freude geweckt .
    Trg :  when i was 11 , i remember waking up one morning to the sound of joy in my house .
    Pred:  when i was born years old , i was a <unk> of the <unk> of the <unk> .
    
    Example #2
    Src :  mein vater hörte sich auf seinem kleinen , grauen radio die <unk> der bbc an .
    Trg :  my father was listening to bbc news on his small , gray radio .
    Pred:  my father was on his <unk> , the <unk> of the <unk> .
    
    Example #3
    Src :  er sah sehr glücklich aus , was damals ziemlich ungewöhnlich war , da ihn die nachrichten meistens <unk> .
    Trg :  there was a big smile on his face which was unusual then , because the news mostly depressed him .
    Pred:  he was very interested in the way , what was pretty much more , and then it was the <unk> .
    
    Validation perplexity: 31.839708
    Epoch 1
    Epoch Step: 100 Loss: 4.451122 Tokens per Sec: 19110.156367
    Epoch Step: 200 Loss: 11.262838 Tokens per Sec: 19538.253630
    Epoch Step: 300 Loss: 55.240711 Tokens per Sec: 19584.509548
    Epoch Step: 400 Loss: 54.733456 Tokens per Sec: 19787.183104
    Epoch Step: 500 Loss: 38.923244 Tokens per Sec: 19385.772613
    Epoch Step: 600 Loss: 63.162933 Tokens per Sec: 19013.165752
    Epoch Step: 700 Loss: 47.323864 Tokens per Sec: 18863.104141
    Epoch Step: 800 Loss: 43.414978 Tokens per Sec: 19258.337491
    Epoch Step: 900 Loss: 87.750214 Tokens per Sec: 19179.949782
    Epoch Step: 1000 Loss: 39.787056 Tokens per Sec: 19110.748464
    Epoch Step: 1100 Loss: 78.177170 Tokens per Sec: 19272.044197
    Epoch Step: 1200 Loss: 37.122997 Tokens per Sec: 19194.535740
    Epoch Step: 1300 Loss: 26.103378 Tokens per Sec: 19337.967366
    Epoch Step: 1400 Loss: 78.804855 Tokens per Sec: 19018.413406
    Epoch Step: 1500 Loss: 61.593956 Tokens per Sec: 19259.272095
    Epoch Step: 1600 Loss: 81.611786 Tokens per Sec: 19259.527179
    Epoch Step: 1700 Loss: 28.692696 Tokens per Sec: 19230.891840
    Epoch Step: 1800 Loss: 84.163223 Tokens per Sec: 19071.272023
    Epoch Step: 1900 Loss: 36.782116 Tokens per Sec: 19209.383788
    Epoch Step: 2000 Loss: 56.666332 Tokens per Sec: 19127.522297
    Epoch Step: 2100 Loss: 5.576357 Tokens per Sec: 18957.458966
    Epoch Step: 2200 Loss: 38.791512 Tokens per Sec: 19166.811446
    
    Example #1
    Src :  als ich 11 jahre alt war , wurde ich eines morgens von den <unk> heller freude geweckt .
    Trg :  when i was 11 , i remember waking up one morning to the sound of joy in my house .
    Pred:  when i was 11 years old , i was a <unk> of the <unk> <unk> .
    
    Example #2
    Src :  mein vater hörte sich auf seinem kleinen , grauen radio die <unk> der bbc an .
    Trg :  my father was listening to bbc news on his small , gray radio .
    Pred:  my father was on his <unk> , in the little <unk> , the <unk> of the <unk> .
    
    Example #3
    Src :  er sah sehr glücklich aus , was damals ziemlich ungewöhnlich war , da ihn die nachrichten meistens <unk> .
    Trg :  there was a big smile on his face which was unusual then , because the news mostly depressed him .
    Pred:  he saw very happy , what was pretty much , and it was the <unk> of the <unk> .
    
    Validation perplexity: 19.906190
    Epoch 2
    Epoch Step: 100 Loss: 58.981544 Tokens per Sec: 19121.747106
    Epoch Step: 200 Loss: 34.874680 Tokens per Sec: 19689.768904
    Epoch Step: 300 Loss: 27.895102 Tokens per Sec: 19751.401628
    Epoch Step: 400 Loss: 52.931011 Tokens per Sec: 16369.447354
    Epoch Step: 500 Loss: 77.191933 Tokens per Sec: 16337.808093
    Epoch Step: 600 Loss: 65.645668 Tokens per Sec: 16307.871308
    Epoch Step: 700 Loss: 7.141161 Tokens per Sec: 16420.432824
    Epoch Step: 800 Loss: 76.990250 Tokens per Sec: 17512.558218
    Epoch Step: 900 Loss: 43.835995 Tokens per Sec: 16399.672659
    Epoch Step: 1000 Loss: 68.026192 Tokens per Sec: 16598.504664
    Epoch Step: 1100 Loss: 23.746111 Tokens per Sec: 16368.137311
    Epoch Step: 1200 Loss: 42.117832 Tokens per Sec: 16324.872475
    Epoch Step: 1300 Loss: 47.894409 Tokens per Sec: 16532.223380
    Epoch Step: 1400 Loss: 43.772861 Tokens per Sec: 16472.315811
    Epoch Step: 1500 Loss: 60.978756 Tokens per Sec: 16368.088307
    Epoch Step: 1600 Loss: 59.143227 Tokens per Sec: 16553.220745
    Epoch Step: 1700 Loss: 34.091373 Tokens per Sec: 16557.579342
    Epoch Step: 1800 Loss: 11.551711 Tokens per Sec: 16639.281663
    Epoch Step: 1900 Loss: 40.060520 Tokens per Sec: 16666.679672
    Epoch Step: 2000 Loss: 21.947863 Tokens per Sec: 16403.240568
    Epoch Step: 2100 Loss: 12.891315 Tokens per Sec: 16656.630033
    Epoch Step: 2200 Loss: 12.300262 Tokens per Sec: 16592.045153
    
    Example #1
    Src :  als ich 11 jahre alt war , wurde ich eines morgens von den <unk> heller freude geweckt .
    Trg :  when i was 11 , i remember waking up one morning to the sound of joy in my house .
    Pred:  when i was 11 years old , i was a <unk> of the <unk> of the <unk> .
    
    Example #2
    Src :  mein vater hörte sich auf seinem kleinen , grauen radio die <unk> der bbc an .
    Trg :  my father was listening to bbc news on his small , gray radio .
    Pred:  my father was on his little , <unk> , <unk> the <unk> of the bbc .
    
    Example #3
    Src :  er sah sehr glücklich aus , was damals ziemlich ungewöhnlich war , da ihn die nachrichten meistens <unk> .
    Trg :  there was a big smile on his face which was unusual then , because the news mostly depressed him .
    Pred:  he looked very happy to what was pretty much more , because it was the <unk> of the <unk> .
    
    Validation perplexity: 15.555337
    Epoch 3
    Epoch Step: 100 Loss: 36.178066 Tokens per Sec: 16064.364293
    Epoch Step: 200 Loss: 20.046204 Tokens per Sec: 16557.065342
    Epoch Step: 300 Loss: 53.514584 Tokens per Sec: 16375.767859
    Epoch Step: 400 Loss: 29.280447 Tokens per Sec: 16687.195842
    Epoch Step: 500 Loss: 64.491814 Tokens per Sec: 16491.438857
    Epoch Step: 600 Loss: 62.286755 Tokens per Sec: 16443.863308
    Epoch Step: 700 Loss: 60.861393 Tokens per Sec: 16303.304238
    Epoch Step: 800 Loss: 25.101744 Tokens per Sec: 16437.206262
    Epoch Step: 900 Loss: 41.884624 Tokens per Sec: 16712.862598
    Epoch Step: 1000 Loss: 65.880905 Tokens per Sec: 16406.042864
    Epoch Step: 1100 Loss: 34.799385 Tokens per Sec: 16257.804744
    Epoch Step: 1200 Loss: 57.244125 Tokens per Sec: 16403.685499
    Epoch Step: 1300 Loss: 6.766514 Tokens per Sec: 16262.412676
    Epoch Step: 1400 Loss: 31.528254 Tokens per Sec: 16723.894609
    Epoch Step: 1500 Loss: 4.534189 Tokens per Sec: 16512.533272
    Epoch Step: 1600 Loss: 50.852787 Tokens per Sec: 16820.837828
    Epoch Step: 1700 Loss: 30.657820 Tokens per Sec: 16574.791159
    Epoch Step: 1800 Loss: 75.787910 Tokens per Sec: 16441.350335
    Epoch Step: 1900 Loss: 23.563347 Tokens per Sec: 16836.284727
    Epoch Step: 2000 Loss: 10.594786 Tokens per Sec: 16522.362683
    Epoch Step: 2100 Loss: 40.561062 Tokens per Sec: 16508.617285
    Epoch Step: 2200 Loss: 15.348518 Tokens per Sec: 16624.360367
    
    Example #1
    Src :  als ich 11 jahre alt war , wurde ich eines morgens von den <unk> heller freude geweckt .
    Trg :  when i was 11 , i remember waking up one morning to the sound of joy in my house .
    Pred:  when i was 11 11 years old , i was a <unk> of the <unk> <unk> joy .
    
    Example #2
    Src :  mein vater hörte sich auf seinem kleinen , grauen radio die <unk> der bbc an .
    Trg :  my father was listening to bbc news on his small , gray radio .
    Pred:  my father was on his little , <unk> , <unk> , the <unk> of the bbc .
    
    Example #3
    Src :  er sah sehr glücklich aus , was damals ziemlich ungewöhnlich war , da ihn die nachrichten meistens <unk> .
    Trg :  there was a big smile on his face which was unusual then , because the news mostly depressed him .
    Pred:  he saw very happy , what was pretty much , because it was the <unk> .
    
    Validation perplexity: 13.563748
    Epoch 4
    Epoch Step: 100 Loss: 9.601490 Tokens per Sec: 16309.901017
    Epoch Step: 200 Loss: 13.329712 Tokens per Sec: 16693.352689
    Epoch Step: 300 Loss: 61.213333 Tokens per Sec: 16774.275779
    Epoch Step: 400 Loss: 37.759483 Tokens per Sec: 16628.037095
    Epoch Step: 500 Loss: 35.616104 Tokens per Sec: 16677.874896
    Epoch Step: 600 Loss: 58.753849 Tokens per Sec: 16452.736708
    Epoch Step: 700 Loss: 11.741160 Tokens per Sec: 16615.759446
    Epoch Step: 800 Loss: 24.230316 Tokens per Sec: 16804.673563
    Epoch Step: 900 Loss: 27.786499 Tokens per Sec: 16373.396939
    Epoch Step: 1000 Loss: 65.063515 Tokens per Sec: 16520.381173
    Epoch Step: 1100 Loss: 34.756481 Tokens per Sec: 16492.656502
    Epoch Step: 1200 Loss: 43.993877 Tokens per Sec: 17075.912389
    Epoch Step: 1300 Loss: 36.514729 Tokens per Sec: 16812.641454
    Epoch Step: 1400 Loss: 58.995735 Tokens per Sec: 16535.979640
    Epoch Step: 1500 Loss: 29.516464 Tokens per Sec: 16500.141569
    Epoch Step: 1600 Loss: 10.143467 Tokens per Sec: 16613.933279
    Epoch Step: 1700 Loss: 53.287037 Tokens per Sec: 16756.922926
    Epoch Step: 1800 Loss: 24.687494 Tokens per Sec: 16477.783348
    Epoch Step: 1900 Loss: 21.578268 Tokens per Sec: 16808.344988
    Epoch Step: 2000 Loss: 60.965946 Tokens per Sec: 16651.623717
    Epoch Step: 2100 Loss: 18.895075 Tokens per Sec: 16636.292649
    Epoch Step: 2200 Loss: 53.253704 Tokens per Sec: 16642.799323
    
    Example #1
    Src :  als ich 11 jahre alt war , wurde ich eines morgens von den <unk> heller freude geweckt .
    Trg :  when i was 11 , i remember waking up one morning to the sound of joy in my house .
    Pred:  when i was 11 years old , i was a <unk> of the <unk> <unk> joy .
    
    Example #2
    Src :  mein vater hörte sich auf seinem kleinen , grauen radio die <unk> der bbc an .
    Trg :  my father was listening to bbc news on his small , gray radio .
    Pred:  my dad listened on his little , <unk> radio <unk> the bbc of the bbc .
    
    Example #3
    Src :  er sah sehr glücklich aus , was damals ziemlich ungewöhnlich war , da ihn die nachrichten meistens <unk> .
    Trg :  there was a big smile on his face which was unusual then , because the news mostly depressed him .
    Pred:  he saw a happy very happy , which was pretty much , because he was the most famous <unk> .
    
    Validation perplexity: 12.664111
    Epoch 5
    Epoch Step: 100 Loss: 21.919912 Tokens per Sec: 16266.471497
    Epoch Step: 200 Loss: 31.320656 Tokens per Sec: 16527.955427
    Epoch Step: 300 Loss: 40.778984 Tokens per Sec: 16517.710752
    Epoch Step: 400 Loss: 63.466324 Tokens per Sec: 16770.294841
    Epoch Step: 500 Loss: 49.329956 Tokens per Sec: 16694.936223
    Epoch Step: 600 Loss: 52.290169 Tokens per Sec: 16755.442966
    Epoch Step: 700 Loss: 51.911785 Tokens per Sec: 16768.565847
    Epoch Step: 800 Loss: 25.005857 Tokens per Sec: 16813.186507
    Epoch Step: 900 Loss: 50.679825 Tokens per Sec: 17109.031968
    Epoch Step: 1000 Loss: 13.069316 Tokens per Sec: 16692.984251
    Epoch Step: 1100 Loss: 12.595688 Tokens per Sec: 16546.293379
    Epoch Step: 1200 Loss: 46.846031 Tokens per Sec: 16491.379305
    Epoch Step: 1300 Loss: 30.238283 Tokens per Sec: 16558.196936
    Epoch Step: 1400 Loss: 23.865877 Tokens per Sec: 16556.353749
    Epoch Step: 1500 Loss: 42.451859 Tokens per Sec: 16784.645679
    Epoch Step: 1600 Loss: 37.048477 Tokens per Sec: 16651.129133
    Epoch Step: 1700 Loss: 17.043219 Tokens per Sec: 16655.630464
    Epoch Step: 1800 Loss: 17.227308 Tokens per Sec: 16688.568658
    Epoch Step: 1900 Loss: 23.672441 Tokens per Sec: 16609.439477
    Epoch Step: 2000 Loss: 19.385946 Tokens per Sec: 16586.442474
    Epoch Step: 2100 Loss: 25.717686 Tokens per Sec: 16879.694187
    Epoch Step: 2200 Loss: 22.427767 Tokens per Sec: 16844.504307
    
    Example #1
    Src :  als ich 11 jahre alt war , wurde ich eines morgens von den <unk> heller freude geweckt .
    Trg :  when i was 11 , i remember waking up one morning to the sound of joy in my house .
    Pred:  when i was 11 years old , i was <unk> by the morning of joy .
    
    Example #2
    Src :  mein vater hörte sich auf seinem kleinen , grauen radio die <unk> der bbc an .
    Trg :  my father was listening to bbc news on his small , gray radio .
    Pred:  my father listened on his little , gray radio waves the bbc of the bbc .
    
    Example #3
    Src :  er sah sehr glücklich aus , was damals ziemlich ungewöhnlich war , da ihn die nachrichten meistens <unk> .
    Trg :  there was a big smile on his face which was unusual then , because the news mostly depressed him .
    Pred:  he saw a very happy ending , which was pretty unusual , since then they were <unk> .
    
    Validation perplexity: 12.246438
    Epoch 6
    Epoch Step: 100 Loss: 19.048712 Tokens per Sec: 19024.102757
    Epoch Step: 200 Loss: 31.636736 Tokens per Sec: 19387.779254
    Epoch Step: 300 Loss: 15.952754 Tokens per Sec: 19559.196457
    Epoch Step: 400 Loss: 24.849632 Tokens per Sec: 18968.450791
    Epoch Step: 500 Loss: 47.227837 Tokens per Sec: 19009.957585
    Epoch Step: 600 Loss: 8.887992 Tokens per Sec: 19024.581918
    Epoch Step: 700 Loss: 58.158920 Tokens per Sec: 16834.343585
    Epoch Step: 800 Loss: 32.257362 Tokens per Sec: 16725.454783
    Epoch Step: 900 Loss: 5.977044 Tokens per Sec: 16398.470679
    Epoch Step: 1000 Loss: 51.871101 Tokens per Sec: 16302.492231
    Epoch Step: 1100 Loss: 44.715164 Tokens per Sec: 16505.477988
    Epoch Step: 1200 Loss: 4.128096 Tokens per Sec: 19255.909773
    Epoch Step: 1300 Loss: 53.065189 Tokens per Sec: 19016.853318
    Epoch Step: 1400 Loss: 23.775473 Tokens per Sec: 18877.681861
    Epoch Step: 1500 Loss: 15.587101 Tokens per Sec: 18916.694718
    Epoch Step: 1600 Loss: 59.449795 Tokens per Sec: 19166.565245
    Epoch Step: 1700 Loss: 48.393402 Tokens per Sec: 18836.264938
    Epoch Step: 1800 Loss: 45.651253 Tokens per Sec: 18823.983316
    Epoch Step: 1900 Loss: 51.898994 Tokens per Sec: 19015.027947
    Epoch Step: 2000 Loss: 16.392334 Tokens per Sec: 19180.065119
    Epoch Step: 2100 Loss: 20.312500 Tokens per Sec: 19059.061076
    Epoch Step: 2200 Loss: 41.126842 Tokens per Sec: 19110.648056
    
    Example #1
    Src :  als ich 11 jahre alt war , wurde ich eines morgens von den <unk> heller freude geweckt .
    Trg :  when i was 11 , i remember waking up one morning to the sound of joy in my house .
    Pred:  when i was 11 , i was a <unk> of the <unk> <unk> joy .
    
    Example #2
    Src :  mein vater hörte sich auf seinem kleinen , grauen radio die <unk> der bbc an .
    Trg :  my father was listening to bbc news on his small , gray radio .
    Pred:  my father listened to his little , <unk> radio shack the <unk> of the bbc .
    
    Example #3
    Src :  er sah sehr glücklich aus , was damals ziemlich ungewöhnlich war , da ihn die nachrichten meistens <unk> .
    Trg :  there was a big smile on his face which was unusual then , because the news mostly depressed him .
    Pred:  he looked very happy , which was pretty unusual , and then they had the news <unk> .
    
    Validation perplexity: 12.045694
    Epoch 7
    Epoch Step: 100 Loss: 22.484320 Tokens per Sec: 19136.387726
    Epoch Step: 200 Loss: 54.793003 Tokens per Sec: 19562.003455
    Epoch Step: 300 Loss: 52.516510 Tokens per Sec: 19494.585192
    Epoch Step: 400 Loss: 25.631699 Tokens per Sec: 19127.415568
    Epoch Step: 500 Loss: 15.818419 Tokens per Sec: 18909.082434
    Epoch Step: 600 Loss: 40.660767 Tokens per Sec: 19063.824782
    Epoch Step: 700 Loss: 21.253407 Tokens per Sec: 19011.780769
    Epoch Step: 800 Loss: 9.494976 Tokens per Sec: 19032.447976
    Epoch Step: 900 Loss: 21.503059 Tokens per Sec: 19120.646494
    Epoch Step: 1000 Loss: 34.198826 Tokens per Sec: 18751.274337
    Epoch Step: 1100 Loss: 21.471136 Tokens per Sec: 19119.629059
    Epoch Step: 1200 Loss: 45.433662 Tokens per Sec: 19158.978952
    Epoch Step: 1300 Loss: 48.697639 Tokens per Sec: 18852.568454
    Epoch Step: 1400 Loss: 48.406239 Tokens per Sec: 19090.121092
    Epoch Step: 1500 Loss: 10.506186 Tokens per Sec: 18996.606224
    Epoch Step: 1600 Loss: 22.061657 Tokens per Sec: 18889.519602
    Epoch Step: 1700 Loss: 11.148299 Tokens per Sec: 19179.133196
    Epoch Step: 1800 Loss: 16.580446 Tokens per Sec: 19184.709044
    Epoch Step: 1900 Loss: 20.219671 Tokens per Sec: 18889.205997
    Epoch Step: 2000 Loss: 21.245464 Tokens per Sec: 18869.151894
    Epoch Step: 2100 Loss: 29.567142 Tokens per Sec: 18825.496347
    Epoch Step: 2200 Loss: 22.790722 Tokens per Sec: 18923.950021
    
    Example #1
    Src :  als ich 11 jahre alt war , wurde ich eines morgens von den <unk> heller freude geweckt .
    Trg :  when i was 11 , i remember waking up one morning to the sound of joy in my house .
    Pred:  when i was 11 years old , i was <unk> a <unk> of the <unk> <unk> joy .
    
    Example #2
    Src :  mein vater hörte sich auf seinem kleinen , grauen radio die <unk> der bbc an .
    Trg :  my father was listening to bbc news on his small , gray radio .
    Pred:  my father listened to his little , <unk> radio <unk> the <unk> of the bbc .
    
    Example #3
    Src :  er sah sehr glücklich aus , was damals ziemlich ungewöhnlich war , da ihn die nachrichten meistens <unk> .
    Trg :  there was a big smile on his face which was unusual then , because the news mostly depressed him .
    Pred:  he looked very happy , which was pretty unusual , because he was going to put him in the <unk> .
    
    Validation perplexity: 11.837098
    Epoch 8
    Epoch Step: 100 Loss: 49.162842 Tokens per Sec: 19241.082862
    Epoch Step: 200 Loss: 35.163906 Tokens per Sec: 19633.028114
    Epoch Step: 300 Loss: 10.108455 Tokens per Sec: 17179.927672
    Epoch Step: 400 Loss: 12.883712 Tokens per Sec: 16510.876579
    Epoch Step: 500 Loss: 32.006828 Tokens per Sec: 16459.413702
    Epoch Step: 600 Loss: 21.056961 Tokens per Sec: 16640.683528
    Epoch Step: 700 Loss: 5.884560 Tokens per Sec: 16567.539919
    Epoch Step: 800 Loss: 17.562445 Tokens per Sec: 16529.548052
    Epoch Step: 900 Loss: 25.654568 Tokens per Sec: 16629.045928
    Epoch Step: 1000 Loss: 30.116678 Tokens per Sec: 16519.515326
    Epoch Step: 1100 Loss: 49.594883 Tokens per Sec: 16766.220937
    Epoch Step: 1200 Loss: 35.545147 Tokens per Sec: 16729.972737
    Epoch Step: 1300 Loss: 12.314122 Tokens per Sec: 16479.824355
    Epoch Step: 1400 Loss: 5.982590 Tokens per Sec: 16592.352361
    Epoch Step: 1500 Loss: 23.507740 Tokens per Sec: 16396.264595
    Epoch Step: 1600 Loss: 36.874157 Tokens per Sec: 16554.722618
    Epoch Step: 1700 Loss: 13.514697 Tokens per Sec: 16605.822594
    Epoch Step: 1800 Loss: 6.016938 Tokens per Sec: 16390.681327
    Epoch Step: 1900 Loss: 44.648132 Tokens per Sec: 16575.965569
    Epoch Step: 2000 Loss: 21.025373 Tokens per Sec: 16363.246501
    Epoch Step: 2100 Loss: 32.213993 Tokens per Sec: 16395.313089
    Epoch Step: 2200 Loss: 29.033810 Tokens per Sec: 16528.855537
    
    Example #1
    Src :  als ich 11 jahre alt war , wurde ich eines morgens von den <unk> heller freude geweckt .
    Trg :  when i was 11 , i remember waking up one morning to the sound of joy in my house .
    Pred:  when i was 11 years old , i was <unk> a <unk> of the <unk> <unk> joy .
    
    Example #2
    Src :  mein vater hörte sich auf seinem kleinen , grauen radio die <unk> der bbc an .
    Trg :  my father was listening to bbc news on his small , gray radio .
    Pred:  my father listened to his little , gray radio shack , the radio of the bbc .
    
    Example #3
    Src :  er sah sehr glücklich aus , was damals ziemlich ungewöhnlich war , da ihn die nachrichten meistens <unk> .
    Trg :  there was a big smile on his face which was unusual then , because the news mostly depressed him .
    Pred:  he looked very happy , which was pretty unusual , because he was the news of the most famous .
    
    Validation perplexity: 11.868392
    Epoch 9
    Epoch Step: 100 Loss: 33.819195 Tokens per Sec: 16155.433696
    Epoch Step: 200 Loss: 26.771244 Tokens per Sec: 16447.243194
    Epoch Step: 300 Loss: 22.235714 Tokens per Sec: 16557.847083
    Epoch Step: 400 Loss: 16.233931 Tokens per Sec: 16802.777289
    Epoch Step: 500 Loss: 34.811615 Tokens per Sec: 16637.208199
    Epoch Step: 600 Loss: 11.960271 Tokens per Sec: 16478.541533
    Epoch Step: 700 Loss: 32.807648 Tokens per Sec: 16526.645827
    Epoch Step: 800 Loss: 25.779436 Tokens per Sec: 16572.304586
    Epoch Step: 900 Loss: 18.101871 Tokens per Sec: 16472.573763
    Epoch Step: 1000 Loss: 34.465992 Tokens per Sec: 16489.131609
    Epoch Step: 1100 Loss: 47.311241 Tokens per Sec: 16501.563937
    Epoch Step: 1200 Loss: 22.709623 Tokens per Sec: 16416.828638
    Epoch Step: 1300 Loss: 45.883862 Tokens per Sec: 16338.132985
    Epoch Step: 1400 Loss: 21.321081 Tokens per Sec: 16680.505744
    Epoch Step: 1500 Loss: 11.126824 Tokens per Sec: 16636.646687
    Epoch Step: 1600 Loss: 32.759712 Tokens per Sec: 16440.968759
    Epoch Step: 1700 Loss: 19.354910 Tokens per Sec: 16476.318234
    Epoch Step: 1800 Loss: 14.631118 Tokens per Sec: 16490.663260
    Epoch Step: 1900 Loss: 2.233373 Tokens per Sec: 16390.177497
    Epoch Step: 2000 Loss: 42.503407 Tokens per Sec: 16498.365808
    Epoch Step: 2100 Loss: 35.935966 Tokens per Sec: 16257.764127
    Epoch Step: 2200 Loss: 37.685387 Tokens per Sec: 16498.916279
    
    Example #1
    Src :  als ich 11 jahre alt war , wurde ich eines morgens von den <unk> heller freude geweckt .
    Trg :  when i was 11 , i remember waking up one morning to the sound of joy in my house .
    Pred:  when i was 11 , i was a <unk> of <unk> <unk> joy .
    
    Example #2
    Src :  mein vater hörte sich auf seinem kleinen , grauen radio die <unk> der bbc an .
    Trg :  my father was listening to bbc news on his small , gray radio .
    Pred:  my father listened to his little , gray radio shack the bbc of the bbc .
    
    Example #3
    Src :  er sah sehr glücklich aus , was damals ziemlich ungewöhnlich war , da ihn die nachrichten meistens <unk> .
    Trg :  there was a big smile on his face which was unusual then , because the news mostly depressed him .
    Pred:  he looked very happy , which was pretty unusual since then , they were <unk> the <unk> .
    
    Validation perplexity: 11.886973


```python
plot_perplexity(dev_perplexities)
```


![png](images/output_49_0.png)


## Prediction and Evaluation

Once trained we can use the model to produce a set of translations. 

If we translate the whole validation set, we can use [SacreBLEU](https://github.com/mjpost/sacreBLEU) to get a [BLEU score](https://en.wikipedia.org/wiki/BLEU), which is the most common way to evaluate translations.

#### Important sidenote
Typically you would use SacreBLEU from the **command line** using the output file and original (possibly tokenized) development reference file. This will give you a nice version string that shows how the BLEU score was calculated; for example, if it was lowercased, if it was tokenized (and how), and what smoothing was used. If you want to learn more about how BLEU scores are (and should be) reported, check out [this paper](https://arxiv.org/abs/1804.08771).

However, right now our pre-processed data is only in memory, so we'll calculate the BLEU score right from this notebook for demonstration purposes.

We'll first test the raw BLEU function:


```python
import sacrebleu
```


```python
# this should result in a perfect BLEU of 100%
hypotheses = ["this is a test"]
references = ["this is a test"]
bleu = sacrebleu.raw_corpus_bleu(hypotheses, [references], .01).score
print(bleu)
```

    100.00000000000004


```python
# here the BLEU score will be lower, because some n-grams won't match
hypotheses = ["this is a test"]
references = ["this is a fest"]
bleu = sacrebleu.raw_corpus_bleu(hypotheses, [references], .01).score
print(bleu)
```

    22.360679774997894


Since we did some filtering for speed, our validation set contains 690 sentences.
The references are the tokenized versions, but they should not contain out-of-vocabulary UNKs that our network might have seen. So we'll take the references straight out of the `valid_data` object:


```python
len(valid_data)
```


    690


```python
references = [" ".join(example.trg) for example in valid_data]
print(len(references))
print(references[0])
```

    690
    when i was 11 , i remember waking up one morning to the sound of joy in my house .


```python
references[-2]
```


    "i 'm always the one taking the picture ."


**Now we translate the validation set!**

This might take a little bit of time.

Note that `greedy_decode` will cut-off the sentence when it encounters the end-of-sequence symbol, if we provide it the index of that symbol.


```python
hypotheses = []
alphas = []  # save the last attention scores
for batch in valid_iter:
  batch = rebatch(PAD_INDEX, batch)
  pred, attention = greedy_decode(
    model, batch.src, batch.src_mask, batch.src_lengths, max_len=25,
    sos_index=TRG.vocab.stoi[SOS_TOKEN],
    eos_index=TRG.vocab.stoi[EOS_TOKEN])
  hypotheses.append(pred)
  alphas.append(attention)
```


```python
# we will still need to convert the indices to actual words!
hypotheses[0]
```


    array([  70,   11,   24, 1460,    5,   11,   24,    9,    0,   10,    0,
              0, 1806,    4])


```python
hypotheses = [lookup_words(x, TRG.vocab) for x in hypotheses]
hypotheses[0]
```


    ['when',
     'i',
     'was',
     '11',
     ',',
     'i',
     'was',
     'a',
     '<unk>',
     'of',
     '<unk>',
     '<unk>',
     'joy',
     '.']


```python
# finally, the SacreBLEU raw scorer requires string input, so we convert the lists to strings
hypotheses = [" ".join(x) for x in hypotheses]
print(len(hypotheses))
print(hypotheses[0])
```

    690
    when i was 11 , i was a <unk> of <unk> <unk> joy .


```python
# now we can compute the BLEU score!
bleu = sacrebleu.raw_corpus_bleu(hypotheses, [references], .01).score
print(bleu)
```

    23.4681520210298


## Attention Visualization

We can also visualize the attention scores of the decoder.


```python
def plot_heatmap(src, trg, scores):

    fig, ax = plt.subplots()
    heatmap = ax.pcolor(scores, cmap='viridis')

    ax.set_xticklabels(trg, minor=False, rotation='vertical')
    ax.set_yticklabels(src, minor=False)

    # put the major ticks at the middle of each cell
    # and the x-ticks on top
    ax.xaxis.tick_top()
    ax.set_xticks(np.arange(scores.shape[1]) + 0.5, minor=False)
    ax.set_yticks(np.arange(scores.shape[0]) + 0.5, minor=False)
    ax.invert_yaxis()

    plt.colorbar(heatmap)
    plt.show()
```


```python
# This plots a chosen sentence, for which we saved the attention scores above.
idx = 5
src = valid_data[idx].src + ["</s>"]
trg = valid_data[idx].trg + ["</s>"]
pred = hypotheses[idx].split() + ["</s>"]
pred_att = alphas[idx][0].T[:, :len(pred)]
print("src", src)
print("ref", trg)
print("pred", pred)
plot_heatmap(src, pred, pred_att)
```

    src ['"', 'jetzt', 'kannst', 'du', 'auf', 'eine', 'richtige', 'schule', 'gehen', ',', '"', 'sagte', 'er', '.', '</s>']
    ref ['"', 'you', 'can', 'go', 'to', 'a', 'real', 'school', 'now', ',', '"', 'he', 'said', '.', '</s>']
    pred ['"', 'now', 'you', 'can', 'go', 'to', 'a', 'right', 'school', ',', '"', 'he', 'said', '.', '</s>']


![png](images/output_66_1.png)


# Congratulations! You've finished this notebook.

What didn't we cover?

- Subwords / Byte Pair Encoding [[paper]](https://arxiv.org/abs/1508.07909) [[github]](https://github.com/rsennrich/subword-nmt) let you deal with unknown words. 
- You can implement a [multiplicative/bilinear attention mechanism](https://arxiv.org/abs/1508.04025) instead of the additive one used here.
- We used greedy decoding here to get translations, but you can get better results with beam search.
- The original model only uses a single dropout layer (in the decoder), but you can experiment with adding more dropout layers, for example on the word embeddings and the source word representations.
- You can experiment with multiple encoder/decoder layers.
- Experiment with a benchmarked and improved codebase: [Joey NMT](https://github.com/joeynmt/joeynmt)

If this was useful to your research, please consider citing:

> J. Bastings. 2018. The Annotated Encoder-Decoder with Attention. https://bastings.github.io/annotated_encoder_decoder/

Or use the following Bibtex:

```
@misc{bastings2018annotated,
  title={The Annotated Encoder-Decoder with Attention},
  author={Bastings, J.},
  journal={https://bastings.github.io/annotated\_encoder\_decoder/},
  year={2018}
}
```