{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from IPython.display import Image"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# CNTK 204: Sequence to Sequence Networks with Text Data\n",
    "\n",
    "\n",
    "## Introduction and Background\n",
    "\n",
    "This hands-on tutorial will take you through both the basics of sequence-to-sequence networks, and how to implement them in the Microsoft Cognitive Toolkit. In particular, we will implement a sequence-to-sequence model with attention to perform grapheme to phoneme translation. We will start with some basic theory and then explain the data in more detail, and how you can download it.\n",
    "\n",
    "Andrej Karpathy has a [nice visualization](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) of five common paradigms of neural network architectures:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<img src=\"http://cntk.ai/jup/paradigms.jpg\" width=\"750\"/>"
      ],
      "text/plain": [
       "<IPython.core.display.Image object>"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Figure 1\n",
    "Image(url=\"http://cntk.ai/jup/paradigms.jpg\", width=750)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this tutorial, we are going to be talking about the fourth paradigm: many-to-many where the length of the output does not necessarily equal the length of the input, also known as sequence-to-sequence networks. The input is a sequence with a dynamic length, and the output is also a sequence with some dynamic length. It is the logical extension of the many-to-one paradigm in that previously we were predicting some category (which could easily be one of `V` words where `V` is an entire vocabulary) and now we want to predict a whole sequence of those categories.\n",
    "\n",
    "The applications of sequence-to-sequence networks are nearly limitless. It is a natural fit for machine translation (e.g. English input sequences, French output sequences); automatic text summarization (e.g. full document input sequence, summary output sequence); word to pronunciation models (e.g. character [grapheme] input sequence, pronunciation [phoneme] output sequence); and even parse tree generation (e.g. regular text input, flat parse tree output).\n",
    "\n",
    "## Basic theory\n",
    "\n",
    "A sequence-to-sequence model consists of two main pieces: (1) an encoder; and (2) a decoder. Both the encoder and the decoder are recurrent neural network (RNN) layers that can be implemented using a vanilla RNN, an LSTM, or GRU Blocks (here we will use LSTM). In the basic sequence-to-sequence model, the encoder processes the input sequence into a fixed representation that is fed into the decoder as a context. The decoder then uses some mechanism (discussed below) to decode the processed information into an output sequence. The decoder is a language model that is augmented with some \"strong context\" by the encoder, and so each symbol that it generates is fed back into the decoder for additional context (like a traditional LM). For an English to German translation task, the most basic setup might look something like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<img src=\"http://cntk.ai/jup/s2s.png\" width=\"700\"/>"
      ],
      "text/plain": [
       "<IPython.core.display.Image object>"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Figure 2\n",
    "Image(url=\"http://cntk.ai/jup/s2s.png\", width=700)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The basic sequence-to-sequence network passes the information from the encoder to the decoder by initializing the decoder RNN with the final hidden state of the encoder as its initial hidden state. The input is then a \"sequence start\" tag (`<s>` in the diagram above) which primes the decoder to start generating an output sequence. Then, whatever word (or note or image, etc.) it generates at that step is fed in as the input for the next step. The decoder keeps generating outputs until it hits the special \"end sequence\" tag (`</s>` above).\n",
    "\n",
    "A more complex and powerful version of the basic sequence-to-sequence network uses an attention model. While the above setup works well, it can start to break down when the input sequences get long. At each step, the hidden state `h` is getting updated with the most recent information, and therefore `h` might be getting \"diluted\" in information as it processes each token. Further, even with a relatively short sequence, the last token will always get the last say and therefore the thought vector will be somewhat biased/weighted towards that last word. To deal with this problem, we use an \"attention\" mechanism that allows the decoder to look not only at all of the hidden states from the input, but it also learns which hidden states, for each step in decoding, to put the most weight on. In this tutorial we will implement a sequence-to-sequence network that can be run either with or without attention enabled."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<img src=\"https://cntk.ai/jup/cntk204_s2s2.png\" width=\"700\"/>"
      ],
      "text/plain": [
       "<IPython.core.display.Image object>"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Figure 3\n",
    "Image(url=\"https://cntk.ai/jup/cntk204_s2s2.png\", width=700)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `Attention` layer above takes the current value of the hidden state in the Decoder, all of the hidden states in the Encoder, and calculates an augmented version of the hidden state to use. More specifically, the contribution from the Encoder's hidden states will represent a weighted sum of all of its hidden states where the highest weight corresponds both to the biggest contribution to the augmented hidden state and to the hidden state that will be most important for the Decoder to consider when generating the next word."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Problem: Grapheme-to-Phoneme Conversion\n",
    "\n",
    "The [grapheme](https://en.wikipedia.org/wiki/Grapheme) to [phoneme](https://en.wikipedia.org/wiki/Phoneme) problem is a translation task that takes the letters of a word as the input sequence (the graphemes are the smallest units of a writing system) and outputs the corresponding phonemes; that is, the units of sound that make up a language. In other words, the system aims to generate an unambigious representation of how to pronounce a given input word.\n",
    "\n",
    "**Example**\n",
    "\n",
    "The graphemes or the letters are translated into corresponding phonemes: \n",
    "\n",
    "> **Grapheme** : **|** T **|** A **|** N **|** G **|** E **|** R **|**  \n",
    "**Phonemes** : **|** ~T **|** ~AE **|** ~NG **|** ~ER **|**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Model structure overview**\n",
    "\n",
    "As discussed above, the task we are interested in solving is creating a model that takes some sequence as an input, and generates an output sequence based on the contents of the input. The model's job is to learn the mapping from the input sequence to the output sequence that it will generate. The job of the encoder is to come up with a good representation of the input that the decoder can use to generate a good output. For both the encoder and the decoder, the LSTM does a good job at this.\n",
    "\n",
    "Note that the LSTM is simply one of a whole set of different types of Blocks that can be used to implement an RNN. This is the code that is run for each step in the recurrence. In the Layers library, there are three built-in recurrent Blocks: the (vanilla) `RNN`, the `GRU`, and the `LSTM`. Each processes its input slightly differently and each has its own benefits and drawbacks for different types of tasks and networks. To get these blocks to run for each of the elements recurrently in a network, we create a `Recurrence` over them. This \"unrolls\" the network to the number of steps that are in the given input for the RNN layer."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Importing CNTK and other useful libraries**\n",
    "\n",
    "CNTK is a Python module that contains several submodules like `io`, `learner`, `graph`, etc. We make extensive use of numpy as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from __future__ import print_function\n",
    "import numpy as np\n",
    "import os\n",
    "\n",
    "import cntk as C\n",
    "import cntk.tests.test_utils\n",
    "cntk.tests.test_utils.set_device_from_pytest_env() # (only needed for our build system)\n",
    "C.cntk_py.set_fixed_random_seed(1) # fix a random seed for CNTK components"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Check if this is a test environment\n",
    "def isTest():\n",
    "    return ('TEST_DEVICE' in os.environ) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Downloading the data\n",
    "\n",
    "In this tutorial we will use a lightly pre-processed version of the CMUDict (version 0.7b) dataset from http://www.speech.cs.cmu.edu/cgi-bin/cmudict. The CMUDict data refers to the Carnegie Mellon University Pronouncing Dictionary and is an open-source machine-readable pronunciation dictionary for North American English. The data is in the CNTKTextFormatReader format. Here is an example sequence pair from the data, where the input sequence (S0) is in the left column, and the output sequence (S1) is on the right:\n",
    "\n",
    "```\n",
    "0\t|S0 3:1  |# <s>\t  |S1 3:1 |# <s>\n",
    "0\t|S0 4:1  |# A\t    |S1 32:1 |# ~AH\n",
    "0\t|S0 5:1  |# B\t    |S1 36:1 |# ~B\n",
    "0\t|S0 4:1  |# A\t    |S1 31:1 |# ~AE\n",
    "0\t|S0 7:1  |# D\t    |S1 38:1 |# ~D\n",
    "0\t|S0 12:1 |# I\t    |S1 47:1 |# ~IY\n",
    "0\t|S0 1:1  |# </s>\t |S1 1:1 |# </s>\n",
    "```\n",
    "\n",
    "The code below will download the required files (training, testing, the single sequence above for visual validation, and a small vocab file) and put them in a local folder (the training file is ~34 MB, testing is ~4MB, and the validation file and vocab file are both less than 1KB)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Reusing locally cached: ..\\Examples\\SequenceToSequence\\CMUDict\\Data\\cmudict-0.7b.test.ctf\n",
      "Reusing locally cached: ..\\Examples\\SequenceToSequence\\CMUDict\\Data\\cmudict-0.7b.train-dev-20-21.ctf\n",
      "Reusing locally cached: ..\\Examples\\SequenceToSequence\\CMUDict\\Data\\tiny.ctf\n",
      "Reusing locally cached: ..\\Examples\\SequenceToSequence\\CMUDict\\Data\\cmudict-0.7b.mapping\n"
     ]
    }
   ],
   "source": [
    "import requests\n",
    "\n",
    "def download(url, filename):\n",
    "    \"\"\" utility function to download a file \"\"\"\n",
    "    response = requests.get(url, stream=True)\n",
    "    with open(filename, \"wb\") as handle:\n",
    "        for data in response.iter_content():\n",
    "            handle.write(data)\n",
    "\n",
    "MODEL_DIR = \".\"\n",
    "DATA_DIR = os.path.join('..', 'Examples', 'SequenceToSequence', 'CMUDict', 'Data')\n",
    "# If above directory does not exist, just use current.\n",
    "if not os.path.exists(DATA_DIR):\n",
    "    DATA_DIR = '.'\n",
    "\n",
    "dataPath = {\n",
    "  'validation': 'tiny.ctf',\n",
    "  'training': 'cmudict-0.7b.train-dev-20-21.ctf',\n",
    "  'testing': 'cmudict-0.7b.test.ctf',\n",
    "  'vocab_file': 'cmudict-0.7b.mapping',\n",
    "}\n",
    "\n",
    "for k in sorted(dataPath.keys()):\n",
    "    path = os.path.join(DATA_DIR, dataPath[k])\n",
    "    if os.path.exists(path):\n",
    "        print(\"Reusing locally cached:\", path)\n",
    "    else:\n",
    "        print(\"Starting download:\", dataPath[k])\n",
    "        url = \"https://github.com/Microsoft/CNTK/blob/release/2.6/Examples/SequenceToSequence/CMUDict/Data/%s?raw=true\"%dataPath[k]\n",
    "        download(url, path)\n",
    "        print(\"Download completed\")\n",
    "    dataPath[k] = path"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data Reader\n",
    "\n",
    "To efficiently collect our data, randomize it for training, and pass it to the network, we use the CNTKTextFormat reader. We will create a small function that will be called when training (or testing) that defines the names of the streams in our data, and how they are referred to in the raw training data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Helper function to load the model vocabulary file\n",
    "def get_vocab(path):\n",
    "    # get the vocab for printing output sequences in plaintext\n",
    "    vocab = [w.strip() for w in open(path).readlines()]\n",
    "    i2w = { i:w for i,w in enumerate(vocab) }\n",
    "    w2i = { w:i for i,w in enumerate(vocab) }\n",
    "    \n",
    "    return (vocab, i2w, w2i)\n",
    "\n",
    "# Read vocabulary data and generate their corresponding indices\n",
    "vocab, i2w, w2i = get_vocab(dataPath['vocab_file'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Vocabulary size is 69\n",
      "First 15 letters are:\n",
      "[\"'\", '</s>', '<s/>', '<s>', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K']\n",
      "\n",
      "Print dictionary with the vocabulary mapping:\n",
      "{0: \"'\", 1: '</s>', 2: '<s/>', 3: '<s>', 4: 'A', 5: 'B', 6: 'C', 7: 'D', 8: 'E', 9: 'F', 10: 'G', 11: 'H', 12: 'I', 13: 'J', 14: 'K', 15: 'L', 16: 'M', 17: 'N', 18: 'O', 19: 'P', 20: 'Q', 21: 'R', 22: 'S', 23: 'T', 24: 'U', 25: 'V', 26: 'W', 27: 'X', 28: 'Y', 29: 'Z', 30: '~AA', 31: '~AE', 32: '~AH', 33: '~AO', 34: '~AW', 35: '~AY', 36: '~B', 37: '~CH', 38: '~D', 39: '~DH', 40: '~EH', 41: '~ER', 42: '~EY', 43: '~F', 44: '~G', 45: '~HH', 46: '~IH', 47: '~IY', 48: '~JH', 49: '~K', 50: '~L', 51: '~M', 52: '~N', 53: '~NG', 54: '~OW', 55: '~OY', 56: '~P', 57: '~R', 58: '~S', 59: '~SH', 60: '~T', 61: '~TH', 62: '~UH', 63: '~UW', 64: '~V', 65: '~W', 66: '~Y', 67: '~Z', 68: '~ZH'}\n"
     ]
    }
   ],
   "source": [
    "input_vocab_dim  = 69\n",
    "label_vocab_dim  = 69\n",
    "\n",
    "# Print vocab and the correspoding mapping to the phonemes\n",
    "print(\"Vocabulary size is\", len(vocab))\n",
    "print(\"First 15 letters are:\")\n",
    "print(vocab[:15])\n",
    "print()\n",
    "print(\"Print dictionary with the vocabulary mapping:\")\n",
    "print(i2w)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will use the above to create a reader for our training data. Let's create it now:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def create_reader(path, is_training):\n",
    "    return C.io.MinibatchSource(C.io.CTFDeserializer(path, C.io.StreamDefs(\n",
    "        features = C.io.StreamDef(field='S0', shape=input_vocab_dim, is_sparse=True),\n",
    "        labels   = C.io.StreamDef(field='S1', shape=label_vocab_dim, is_sparse=True)\n",
    "    )), randomize = is_training, max_sweeps = C.io.INFINITELY_REPEAT if is_training else 1)\n",
    "\n",
    "# Train data reader\n",
    "train_reader = create_reader(dataPath['training'], True)\n",
    "\n",
    "# Validation data reader\n",
    "valid_reader = create_reader(dataPath['validation'], True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Set our model hyperparameters**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have a number of settings that control the complexity of our network, the shapes of our inputs, and other options such as whether we will use an embedding (and what size to use), and whether or not we will employ attention. We set them now as they will be made use of when we build the network graph in the following sections."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "hidden_dim = 512\n",
    "num_layers = 2\n",
    "attention_dim = 128\n",
    "use_attention = True\n",
    "use_embedding = True\n",
    "embedding_dim = 200\n",
    "vocab = ([w.strip() for w in open(dataPath['vocab_file']).readlines()]) # all lines of vocab_file in a list\n",
    "length_increase = 1.5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Model Creation\n",
    "\n",
    "We will set two more parameters now: the symbols used to denote the start of a sequence (sometimes called 'BOS') and the end of a sequence (sometimes called 'EOS'). In this case, our sequence-start symbol is the tag $<s>$ and our sequence-end symbol is the end-tag $</s>$.\n",
    "\n",
    "Sequence start and end tags are important in sequence-to-sequence networks for two reasons. The sequence start tag is a \"primer\" for the decoder; in other words, because we are generating an output sequence and RNNs require some input, the sequence start token \"primes\" the decoder to cause it to emit its first generated token. The sequence end token is important because the decoder will learn to output this token when the sequence is finished. Otherwise the network wouldn't know how long of a sequence to generate. For the code below, we setup the sequence start symbol as a `Constant` so that it can later be passed to the Decoder LSTM as its `initial_state`. Further, we get the sequence end symbol's index so that the Decoder can use it to know when to stop generating tokens."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "sentence_start =C.Constant(np.array([w=='<s>' for w in vocab], dtype=np.float32))\n",
    "sentence_end_index = vocab.index('</s>')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: setup the input to the network\n",
    "\n",
    "### Dynamic axes in CNTK (Key concept)\n",
    "\n",
    "One of the important concepts in understanding CNTK is the idea of two types of axes:\n",
    "- **static axes**, which are the traditional axes of a variable's shape, and\n",
    "- **dynamic axes**, which have dimensions that are unknown until the variable is bound to real data at computation time.\n",
    "\n",
    "The dynamic axes are particularly important in the world of recurrent neural networks. Instead of having to decide a maximum sequence length ahead of time, padding your sequences to that size, and wasting computation, CNTK's dynamic axes allow for variable sequence lengths that are automatically packed in minibatches to be as efficient as possible.\n",
    "\n",
    "When setting up sequences, there are *two dynamic axes* that are important to consider. The first is the *batch axis*, which is the axis along which multiple sequences are batched. The second is the dynamic axis particular to that sequence. The latter is specific to a particular input because of variable sequence lengths in your data. For example, in sequence to sequence networks, we have two sequences: the **input sequence**, and the **output (or 'label') sequence**. One of the things that makes this type of network so powerful is that the length of the input sequence and the output sequence do not have to correspond to each other. Therefore, both the input sequence and the output sequence require their own unique dynamic axis.\n",
    "\n",
    "We first create the `inputAxis` for the input sequence and the `labelAxis` for the output sequence. We then define the inputs to the model by creating sequences over these two unique dynamic axes. Note that `InputSequence` and `LabelSequence` are *type declarations*. This means that the `InputSequence` is a type that consists of a sequence over the `inputAxis` axis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Source and target inputs to the model\n",
    "inputAxis = C.Axis('inputAxis')\n",
    "labelAxis = C.Axis('labelAxis')\n",
    "InputSequence = C.layers.SequenceOver[inputAxis]\n",
    "LabelSequence = C.layers.SequenceOver[labelAxis]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: define the network\n",
    "\n",
    "As discussed before, the sequence-to-sequence network is, at its most basic, an RNN (LSTM) encoder followed by an RNN (LSTM) decoder, and a dense output layer. We will implement both the Encoder and the Decoder using the CNTK Layers library. Both of these will be created as CNTK Functions. Our `create_model()` Python function creates both the `encode` and `decode` CNTK Functions. The `decode` function directly makes use of the `encode` function and the return value of `create_model()` is the CNTK Function `decode` itself.\n",
    "\n",
    "We start by passing the input through an embedding (learned as part of the training process). So that this function can be used in the `Sequential` block of the Encoder and the Decoder whether we want an embedding or not, we will use the `identity` function if the `use_embedding` parameter is `False`. We then declare the Encoder layers as follows:\n",
    "\n",
    "First, we pass the input through our `embed` function and then we stabilize it. This adds an additional scalar parameter to the learning that can help our network converge more quickly during training. Then, for each of the number of LSTM layers that we want in our encoder, except the final one, we set up an LSTM recurrence. The final recurrence will be a `Fold` if we are not using attention because we only pass the final hidden state to the decoder. If we are using attention, however, then we use another normal LSTM `Recurrence` that the Decoder will put its attention over later on.\n",
    "\n",
    "Below we see a diagram of how the layered version of the sequence-to-sequence network with attention works. As the code shows below, the output of each layer of the Encoder and Decoder is used as the input to the layer just above it. The Attention model focuses on the top layer of the Encoder and informs the first layer of the Decoder."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<img src=\"https://cntk.ai/jup/cntk204_s2s3.png\" width=\"900\"/>"
      ],
      "text/plain": [
       "<IPython.core.display.Image object>"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Figure 4\n",
    "Image(url=\"https://cntk.ai/jup/cntk204_s2s3.png\", width=900)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the decoder, we first define several sub-layers: the `Stabilizer` for the decoder input, the `Recurrence` blocks for each of the decoder's layers, the `Stabilizer` for the output of the stack of LSTMs, and the final `Dense` output layer. If we are using attention, then we also create an `AttentionModel` function `attention_model` which returns an augmented version of the decoder's hidden state with emphasis placed on the encoder hidden states that should be most used for the given step while generating the next output token.\n",
    "\n",
    "We then build the CNTK Function `decode`. The decorator `@Function` turns a regular Python function into a proper CNTK Function with the given arguments and return value. The Decoder works differently during training than it does during test time. During training, the history (i.e. input) to the Decoder `Recurrence` consists of the ground-truth labels. This means that while generating $y^{(t=2)}$, for example, the input will be $y^{(t=1)}$. During evaluation, or \"test time\", however, the input to the Decoder will be the actual output of the model. For a greedy decoder -- which we are implementing here -- that input is therefore the `hardmax` of the final `Dense` layer.\n",
    "\n",
    "The Decoder Function `decode` takes two arguments: (1) the `input` sequence; and (2) the Decoder `history`. First, it runs the `input` sequence through the Encoder function `encode` that we setup earlier. We then get the `history` and map it to its embedding if necessary. Then the embedded representation is stabilized before running it through the Decoder's `Recurrence`. For each layer of `Recurrence`, we run the embedded `history` (now represented as `r`) through the `Recurrence`'s LSTM. If we are not using attention, we run it through the `Recurrence` with its initial state set to the value of the final hidden state of the encoder (note that since we run the Encoder backwards when not using attention that the \"final\" hidden state is actually the first hidden state in chronological time). If we are using attention, however, then we calculate the auxiliary input `h_att` using our `attention_model` function and we splice that onto the input `x`. This augmented `x` is then used as input for the Decoder's `Recurrence`.\n",
    "\n",
    "Finally, we stabilize the output of the Decoder, put it through the final `Dense` layer `proj_out`, and label the output using the `Label` layer which allows for simple access to that layer later on."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# create the s2s model\n",
    "def create_model(): # :: (history*, input*) -> logP(w)*\n",
    "    \n",
    "    # Embedding: (input*) --> embedded_input*\n",
    "    embed = C.layers.Embedding(embedding_dim, name='embed') if use_embedding else identity\n",
    "    \n",
    "    # Encoder: (input*) --> (h0, c0)\n",
    "    # Create multiple layers of LSTMs by passing the output of the i-th layer\n",
    "    # to the (i+1)th layer as its input\n",
    "    # Note: We go_backwards for the plain model, but forward for the attention model.\n",
    "    with C.layers.default_options(enable_self_stabilization=True, go_backwards=not use_attention):\n",
    "        LastRecurrence = C.layers.Fold if not use_attention else C.layers.Recurrence\n",
    "        encode = C.layers.Sequential([\n",
    "            embed,\n",
    "            C.layers.Stabilizer(),\n",
    "            C.layers.For(range(num_layers-1), lambda:\n",
    "                C.layers.Recurrence(C.layers.LSTM(hidden_dim))),\n",
    "            LastRecurrence(C.layers.LSTM(hidden_dim), return_full_state=True),\n",
    "            (C.layers.Label('encoded_h'), C.layers.Label('encoded_c')),\n",
    "        ])\n",
    "\n",
    "    # Decoder: (history*, input*) --> unnormalized_word_logp*\n",
    "    # where history is one of these, delayed by 1 step and <s> prepended:\n",
    "    #  - training: labels\n",
    "    #  - testing:  its own output hardmax(z) (greedy decoder)\n",
    "    with C.layers.default_options(enable_self_stabilization=True):\n",
    "        # sub-layers\n",
    "        stab_in = C.layers.Stabilizer()\n",
    "        rec_blocks = [C.layers.LSTM(hidden_dim) for i in range(num_layers)]\n",
    "        stab_out = C.layers.Stabilizer()\n",
    "        proj_out = C.layers.Dense(label_vocab_dim, name='out_proj')\n",
    "        # attention model\n",
    "        if use_attention: # maps a decoder hidden state and all the encoder states into an augmented state\n",
    "            attention_model = C.layers.AttentionModel(attention_dim, \n",
    "                                                      name='attention_model') # :: (h_enc*, h_dec) -> (h_dec augmented)\n",
    "        # layer function\n",
    "        @C.Function\n",
    "        def decode(history, input):\n",
    "            encoded_input = encode(input)\n",
    "            r = history\n",
    "            r = embed(r)\n",
    "            r = stab_in(r)\n",
    "            for i in range(num_layers):\n",
    "                rec_block = rec_blocks[i]   # LSTM(hidden_dim)  # :: (dh, dc, x) -> (h, c)\n",
    "                if use_attention:\n",
    "                    if i == 0:\n",
    "                        @C.Function\n",
    "                        def lstm_with_attention(dh, dc, x):\n",
    "                            h_att = attention_model(encoded_input.outputs[0], dh)\n",
    "                            x = C.splice(x, h_att)\n",
    "                            return rec_block(dh, dc, x)\n",
    "                        r = C.layers.Recurrence(lstm_with_attention)(r)\n",
    "                    else:\n",
    "                        r = C.layers.Recurrence(rec_block)(r)\n",
    "                else:\n",
    "                    # unlike Recurrence(), the RecurrenceFrom() layer takes the initial hidden state as a data input\n",
    "                    r = C.layers.RecurrenceFrom(rec_block)(*(encoded_input.outputs + (r,))) # :: h0, c0, r -> h                    \n",
    "            r = stab_out(r)\n",
    "            r = proj_out(r)\n",
    "            r = C.layers.Label('out_proj_out')(r)\n",
    "            return r\n",
    "\n",
    "    return decode"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The network that we defined above can be thought of as an \"abstract\" model that must first be wrapped to be used. In this case, we will use it first to create a \"training\" version of the model (where the history for the Decoder will be the ground-truth labels), and then we will use it to create a greedy \"decoding\" version of the model where the history for the Decoder will be the `hardmax` output of the network. Let's set up these model wrappers next."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training\n",
    "\n",
    "Before starting training, we will define the training wrapper, the greedy decoding wrapper, and the criterion function used for training the model. Let's start with the training wrapper."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def create_model_train(s2smodel):\n",
    "    # model used in training (history is known from labels)\n",
    "    # note: the labels must NOT contain the initial <s>\n",
    "    @C.Function\n",
    "    def model_train(input, labels): # (input*, labels*) --> (word_logp*)\n",
    "\n",
    "        # The input to the decoder always starts with the special label sequence start token.\n",
    "        # Then, use the previous value of the label sequence (for training) or the output (for execution).\n",
    "        past_labels = C.layers.Delay(initial_state=sentence_start)(labels)\n",
    "        return s2smodel(past_labels, input)\n",
    "    return model_train"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Above, we create the CNTK Function `model_train` again using the `@Function` decorator. This function takes the input sequence `input` and the output sequence `labels` as arguments. The `past_labels` are setup as the `history` for the model we created earlier by using the `Delay` layer. This will return the previous time-step value for the input `labels` with an `initial_state` of `sentence_start`. Therefore, if we give the labels `['a', 'b', 'c']`, then `past_labels` will contain `['<s>', 'a', 'b', 'c']` and then return our abstract base model called with the history `past_labels` and the input `input`.\n",
    "\n",
    "Let's go ahead and create the greedy decoding model wrapper now as well:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def create_model_greedy(s2smodel):\n",
    "    # model used in (greedy) decoding (history is decoder's own output)\n",
    "    @C.Function\n",
    "    @C.layers.Signature(InputSequence[C.layers.Tensor[input_vocab_dim]])\n",
    "    def model_greedy(input): # (input*) --> (word_sequence*)\n",
    "\n",
    "        # Decoding is an unfold() operation starting from sentence_start.\n",
    "        # We must transform s2smodel (history*, input* -> word_logp*) into a generator (history* -> output*)\n",
    "        # which holds 'input' in its closure.\n",
    "        unfold = C.layers.UnfoldFrom(lambda history: s2smodel(history, input) >> C.hardmax,\n",
    "                            # stop once sentence_end_index was max-scoring output\n",
    "                            until_predicate=lambda w: w[...,sentence_end_index],\n",
    "                            length_increase=length_increase)\n",
    "        \n",
    "        return unfold(initial_state=sentence_start, dynamic_axes_like=input)\n",
    "    return model_greedy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Above we create a new CNTK Function `model_greedy` which this time only takes a single argument. This is of course because when using the model at test time we don't have any labels -- it is the model's job to create them for us! In this case, we use the `UnfoldFrom` layer which runs the base model with the current `history` and funnels it into the `hardmax`. The `hardmax`'s output then becomes part of the `history` and we keep unfolding the `Recurrence` until the `sentence_end_index` has been reached. The maximum length of the output sequence (the maximum unfolding of the Decoder) is determined by a multiplier passed to `length_increase`. In this case we set `length_increase` to `1.5` above so the maximum length of each output sequence is 1.5x its input.\n",
    "\n",
    "The last thing we will do before setting up the training loop is define the function that will create the criterion function for our model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def create_criterion_function(model):\n",
    "    @C.Function\n",
    "    @C.layers.Signature(input=InputSequence[C.layers.Tensor[input_vocab_dim]], \n",
    "                        labels=LabelSequence[C.layers.Tensor[label_vocab_dim]])\n",
    "    def criterion(input, labels):\n",
    "        # criterion function must drop the <s> from the labels\n",
    "        postprocessed_labels = C.sequence.slice(labels, 1, 0) # <s> A B C </s> --> A B C </s>\n",
    "        z = model(input, postprocessed_labels)\n",
    "        ce = C.cross_entropy_with_softmax(z, postprocessed_labels)\n",
    "        errs = C.classification_error(z, postprocessed_labels)\n",
    "        return (ce, errs)\n",
    "\n",
    "    return criterion"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Above, we create the criterion function which drops the sequence-start symbol from our labels for us, runs the model with the given `input` and `labels`, and uses the output to compare to our ground truth. We use the loss function `cross_entropy_with_softmax` and get the `classification_error` which gives us the percent-error per-word of our generation accuracy. The CNTK Function `criterion` returns these values as a tuple and the Python function `create_criterion_function(model)` returns that CNTK Function.\n",
    "\n",
    "Now let's move on to creating the training loop..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def train(train_reader, valid_reader, vocab, i2w, s2smodel, max_epochs, epoch_size):\n",
    "\n",
    "    # create the training wrapper for the s2smodel, as well as the criterion function\n",
    "    model_train = create_model_train(s2smodel)\n",
    "    criterion = create_criterion_function(model_train)\n",
    "\n",
    "    # also wire in a greedy decoder so that we can properly log progress on a validation example\n",
    "    # This is not used for the actual training process.\n",
    "    model_greedy = create_model_greedy(s2smodel)\n",
    "\n",
    "    # Instantiate the trainer object to drive the model training\n",
    "    minibatch_size = 72\n",
    "    lr = 0.001 if use_attention else 0.005\n",
    "    learner = C.fsadagrad(model_train.parameters,\n",
    "                          #apply the learning rate as if it is a minibatch of size 1\n",
    "                          lr = C.learning_parameter_schedule_per_sample([lr]*2+[lr/2]*3+[lr/4], epoch_size),\n",
    "                          momentum = C.momentum_schedule(0.9366416204111472, minibatch_size=minibatch_size),\n",
    "                          gradient_clipping_threshold_per_sample=2.3,\n",
    "                          gradient_clipping_with_truncation=True)\n",
    "    trainer = C.Trainer(None, criterion, learner)\n",
    "\n",
    "    # Get minibatches of sequences to train with and perform model training\n",
    "    total_samples = 0\n",
    "    mbs = 0\n",
    "    eval_freq = 100\n",
    "\n",
    "    # print out some useful training information\n",
    "    C.logging.log_number_of_parameters(model_train) ; print()\n",
    "    progress_printer = C.logging.ProgressPrinter(freq=30, tag='Training')    \n",
    "\n",
    "    # a hack to allow us to print sparse vectors\n",
    "    sparse_to_dense = create_sparse_to_dense(input_vocab_dim)\n",
    "\n",
    "    for epoch in range(max_epochs):\n",
    "        while total_samples < (epoch+1) * epoch_size:\n",
    "            # get next minibatch of training data\n",
    "            mb_train = train_reader.next_minibatch(minibatch_size)\n",
    "            \n",
    "            # do the training\n",
    "            trainer.train_minibatch({criterion.arguments[0]: mb_train[train_reader.streams.features], \n",
    "                                     criterion.arguments[1]: mb_train[train_reader.streams.labels]})\n",
    "\n",
    "            progress_printer.update_with_trainer(trainer, with_metric=True) # log progress\n",
    "\n",
    "            # every N MBs evaluate on a test sequence to visually show how we're doing\n",
    "            if mbs % eval_freq == 0:\n",
    "                mb_valid = valid_reader.next_minibatch(1)\n",
    "\n",
    "                # run an eval on the decoder output model (i.e. don't use the groundtruth)\n",
    "                e = model_greedy(mb_valid[valid_reader.streams.features])\n",
    "                print(format_sequences(sparse_to_dense(mb_valid[valid_reader.streams.features]), i2w))\n",
    "                print(\"->\")\n",
    "                print(format_sequences(e, i2w))\n",
    "\n",
    "                # visualizing attention window\n",
    "                if use_attention:\n",
    "                    debug_attention(model_greedy, mb_valid[valid_reader.streams.features])\n",
    "\n",
    "            total_samples += mb_train[train_reader.streams.labels].num_samples\n",
    "            mbs += 1\n",
    "\n",
    "        # log a summary of the stats for the epoch\n",
    "        progress_printer.epoch_summary(with_metric=True)\n",
    "\n",
    "    # done: save the final model\n",
    "    model_path = \"model_%d.cmf\" % epoch\n",
    "    print(\"Saving final model to '%s'\" % model_path)\n",
    "    s2smodel.save(model_path)\n",
    "    print(\"%d epochs complete.\" % max_epochs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the above function, we created one version of the model for training (plus its associated criterion function) and one version of the model for evaluation. Normally this latter version would not be required but here we have done it so that we can periodically sample from the non-training model to visually understand how our model is converging by seeing the kinds of sequences that it generates as the training progresses.\n",
    "\n",
    "We then setup some standard variables required for the training loop. We set the `minibatch_size` (which refers to the total number of elements -- NOT sequences -- in a minibatch), the initial learning rate `lr`, we initialize a `learner` using the `adam_sgd` algorithm and a `learning_rate_schedule` that slowly reduces our learning rate. We make use of gradient clipping to help control exploding gradients, and we finally create our `Trainer` object `trainer`.\n",
    "\n",
    "We make use of CNTK's `ProgressPrinter` class which takes care of calculating average metrics per minibatch/epoch and we set it to update every 30 minibatches. And finally, before starting the training loop, we initialize a function called `sparse_to_dense` which we use to properly print out the input sequence data that we use for validation because it is sparse. That function is defined just below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# dummy for printing the input sequence below. Currently needed because input is sparse.\n",
    "def create_sparse_to_dense(input_vocab_dim):\n",
    "    I = C.Constant(np.eye(input_vocab_dim))\n",
    "    @C.Function\n",
    "    @C.layers.Signature(InputSequence[C.layers.SparseTensor[input_vocab_dim]])\n",
    "    def no_op(input):\n",
    "        return C.times(input, I)\n",
    "    return no_op"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Inside the training loop, we proceed much like many other CNTK networks. We request the next bunch of minibatch data, we perform our training, and we print our progress to the screen using the `progress_printer`. Where we diverge from the norm, however, is where we run an evaluation using our `model_greedy` version of the network and run a single sequence, \"ABADI\" through to see what the network is currently predicting.\n",
    "\n",
    "Another difference in the training loop is the optional attention window visualization. Calling the function `debug_attention` shows the weight that the Decoder put on each of the Encoder's hidden states for each of the output tokens that it generated. This function, along with the `format_sequences` function required to print the input/output sequences to the screen, are given below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Given a vocab and tensor, print the output\n",
    "def format_sequences(sequences, i2w):\n",
    "    return [\" \".join([i2w[np.argmax(w)] for w in s]) for s in sequences]\n",
    "\n",
    "# to help debug the attention window\n",
    "def debug_attention(model, input):\n",
    "    q = C.combine([model, model.attention_model.attention_weights])\n",
    "    #words, p = q(input) # Python 3\n",
    "    words_p = q(input)\n",
    "    words = words_p[0]\n",
    "    p     = words_p[1]\n",
    "    output_seq_len = words[0].shape[0]\n",
    "    p_sq = np.squeeze(p[0][:output_seq_len,:,:]) # (batch, output_len, input_len, 1)\n",
    "    opts = np.get_printoptions()\n",
    "    np.set_printoptions(precision=5)\n",
    "    print(p_sq)\n",
    "    np.set_printoptions(**opts)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's try training our network for a small part of an epoch. In particular, we'll run through 25,000 tokens (about 3% of one epoch):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training 8347832 parameters in 29 parameter tensors.\n",
      "\n",
      "['<s> A B A D I </s>']\n",
      "->\n",
      "['O O A ~M ~R </s>']\n",
      "1\n",
      "6\n",
      "(6, 69)\n",
      "[[ 0.14166  0.1422   0.14248  0.14305  0.14337  0.14396  0.14327]\n",
      " [ 0.14166  0.1422   0.14248  0.14305  0.14337  0.14396  0.14327]\n",
      " [ 0.14166  0.1422   0.14248  0.14305  0.14337  0.14396  0.14328]\n",
      " [ 0.14166  0.1422   0.14248  0.14305  0.14337  0.14396  0.14328]\n",
      " [ 0.14166  0.1422   0.14248  0.14305  0.14337  0.14395  0.14327]\n",
      " [ 0.14166  0.1422   0.14248  0.14305  0.14337  0.14396  0.14327]]\n",
      " Minibatch[   1-  30]: loss = 4.145851 * 1601, metric = 87.38% * 1601;\n",
      " Minibatch[  31-  60]: loss = 3.648529 * 1601, metric = 86.45% * 1601;\n",
      " Minibatch[  61-  90]: loss = 3.320425 * 1548, metric = 88.44% * 1548;\n",
      "['<s> A B A D I </s>']\n",
      "->\n",
      "['~N ~N </s>']\n",
      "1\n",
      "3\n",
      "(3, 69)\n",
      "[[ 0.1419   0.14222  0.14241  0.14298  0.14331  0.14392  0.14326]\n",
      " [ 0.1419   0.14223  0.14241  0.14298  0.14331  0.14391  0.14326]\n",
      " [ 0.1419   0.14223  0.14241  0.14298  0.14331  0.14391  0.14326]]\n",
      " Minibatch[  91- 120]: loss = 3.232078 * 1567, metric = 86.02% * 1567;\n",
      " Minibatch[ 121- 150]: loss = 3.212716 * 1580, metric = 83.61% * 1580;\n",
      " Minibatch[ 151- 180]: loss = 3.215528 * 1544, metric = 84.26% * 1544;\n",
      "['<s> A B A D I </s>']\n",
      "->\n",
      "['~R ~R ~AH ~AH ~AH </s>']\n",
      "1\n",
      "6\n",
      "(6, 69)\n",
      "[[ 0.14147  0.142    0.14236  0.14305  0.14347  0.14414  0.14352]\n",
      " [ 0.14147  0.142    0.14236  0.14305  0.14347  0.14413  0.14352]\n",
      " [ 0.14147  0.142    0.14236  0.14305  0.14347  0.14413  0.14352]\n",
      " [ 0.14147  0.142    0.14236  0.14305  0.14347  0.14413  0.14352]\n",
      " [ 0.14147  0.142    0.14236  0.14305  0.14347  0.14413  0.14352]\n",
      " [ 0.14147  0.142    0.14236  0.14305  0.14347  0.14413  0.14352]]\n",
      " Minibatch[ 181- 210]: loss = 3.145194 * 1565, metric = 82.81% * 1565;\n",
      " Minibatch[ 211- 240]: loss = 3.186279 * 1583, metric = 83.26% * 1583;\n",
      " Minibatch[ 241- 270]: loss = 3.127671 * 1562, metric = 83.10% * 1562;\n",
      " Minibatch[ 271- 300]: loss = 3.152009 * 1551, metric = 83.69% * 1551;\n",
      "['<s> A B A D I </s>']\n",
      "->\n",
      "['~R ~R ~R ~AH </s>']\n",
      "1\n",
      "5\n",
      "(5, 69)\n",
      "[[ 0.1412   0.14181  0.14228  0.14308  0.14358  0.14431  0.14373]\n",
      " [ 0.14121  0.14182  0.14228  0.14308  0.14358  0.14431  0.14373]\n",
      " [ 0.14121  0.14181  0.14228  0.14308  0.14358  0.14431  0.14373]\n",
      " [ 0.14121  0.14181  0.14228  0.14308  0.14358  0.14431  0.14373]\n",
      " [ 0.14121  0.14181  0.14228  0.14308  0.14358  0.14431  0.14373]]\n",
      " Minibatch[ 301- 330]: loss = 3.134388 * 1575, metric = 82.92% * 1575;\n",
      " Minibatch[ 331- 360]: loss = 3.098724 * 1569, metric = 82.86% * 1569;\n",
      " Minibatch[ 361- 390]: loss = 3.104150 * 1567, metric = 82.45% * 1567;\n",
      "['<s> A B A D I </s>']\n",
      "->\n",
      "['~K ~R ~R ~AH </s>']\n",
      "1\n",
      "5\n",
      "(5, 69)\n",
      "[[ 0.14097  0.14161  0.14217  0.14309  0.14369  0.14451  0.14396]\n",
      " [ 0.14099  0.14162  0.14218  0.14309  0.14368  0.14449  0.14395]\n",
      " [ 0.14098  0.14162  0.14218  0.14309  0.14368  0.1445   0.14395]\n",
      " [ 0.14098  0.14162  0.14218  0.14309  0.14369  0.1445   0.14396]\n",
      " [ 0.14098  0.14162  0.14218  0.14309  0.14368  0.1445   0.14396]]\n",
      " Minibatch[ 391- 420]: loss = 3.123781 * 1601, metric = 82.26% * 1601;\n",
      "Finished Epoch[1]: [Training] loss = 3.276010 * 22067, metric = 84.25% * 22067 47.957s (460.1 samples/s);\n",
      "Saving final model to 'model_0.cmf'\n",
      "1 epochs complete.\n"
     ]
    }
   ],
   "source": [
    "model = create_model()\n",
    "train(train_reader, valid_reader, vocab, i2w, model, max_epochs=1, epoch_size=25000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see above, while the loss has come down quite a ways, the output sequence is still quite a ways off from what we expect. Uncomment the code below to run for a full epoch (notice that we switch the `epoch_size` parameter to the actual size of the training data) and by the end of the first epoch you will already see a very good grapheme-to-phoneme translation model running!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Uncomment the line below to train the model for a full epoch\n",
    "#train(train_reader, valid_reader, vocab, i2w, model, max_epochs=1, epoch_size=908241)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Testing the network\n",
    "\n",
    "Now that we've trained a sequence-to-sequence network for graphme-to-phoneme translation, there are two important things we should do with it. First, we should test its accuracy on a held-out test set. Then, we should try it out in an interactive environment so that we can put in our own input sequences and see what the model predicts. Let's start by determining the test string error rate.\n",
    "\n",
    "At the end of training, we saved the model using the line `s2smodel.save(model_path)`. Therefore, to test it, we will need to first `load` that model and then run some test data through it. Let's `load` the model, then create a reader configured to access our testing data. Note that we pass `False` to the `create_reader` function this time to denote that we are in testing mode so we should only pass over the data a single time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# load the model for epoch 0\n",
    "model_path = \"model_0.cmf\"\n",
    "model = C.Function.load(model_path)\n",
    "\n",
    "# create a reader pointing at our testing data\n",
    "test_reader = create_reader(dataPath['testing'], False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we need to define our testing function. We pass the `reader`, the learned `s2smodel`, and the vocabulary map `i2w` so that we can directly compare the model's predictions to the test set labels. We loop over the test set, evaluate the model on minibatches of size 512 for efficiency, and keep track of the error rate. Note that below we test *per-sequence*. This means that every single token in a generated sequence must match the tokens in the label for that sequence to be considered as correct."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# This decodes the test set and counts the string error rate.\n",
    "def evaluate_decoding(reader, s2smodel, i2w):\n",
    "    \n",
    "    model_decoding = create_model_greedy(s2smodel) # wrap the greedy decoder around the model\n",
    "\n",
    "    progress_printer = C.logging.ProgressPrinter(tag='Evaluation')\n",
    "\n",
    "    sparse_to_dense = create_sparse_to_dense(input_vocab_dim)\n",
    "\n",
    "    minibatch_size = 512\n",
    "    num_total = 0\n",
    "    num_wrong = 0\n",
    "    while True:\n",
    "        mb = reader.next_minibatch(minibatch_size)\n",
    "        if not mb: # finish when end of test set reached\n",
    "            break\n",
    "        e = model_decoding(mb[reader.streams.features])\n",
    "        outputs = format_sequences(e, i2w)\n",
    "        labels  = format_sequences(sparse_to_dense(mb[reader.streams.labels]), i2w)\n",
    "        # prepend sentence start for comparison\n",
    "        outputs = [\"<s> \" + output for output in outputs]\n",
    "\n",
    "        num_total += len(outputs)\n",
    "        num_wrong += sum([label != output for output, label in zip(outputs, labels)])\n",
    "        \n",
    "    rate = num_wrong / num_total\n",
    "    print(\"string error rate of {:.1f}% in {} samples\".format(100 * rate, num_total))\n",
    "    return rate"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we will evaluate the decoding using the above function. If you use the version of the model we trained above with just a small 50000 sample of the training data, you will get an error rate of 100% because we cannot possibly get every single token correct with such a small amount of training. However, if you uncommented the training line above that trains the network for a full epoch, you should have ended up with a much-improved model that showed approximately the following training statistics:\n",
    "\n",
    "```\n",
    "Finished Epoch[1 of 300]: [Training] loss = 0.878420 * 799303, metric = 26.23% * 799303 1755.985s (455.2 samples/s);\n",
    "```\n",
    "\n",
    "Now let's evaluate the model's test set performance below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "string error rate of 100.0% in 12855 samples\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "1.0"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# print the string error rate\n",
    "evaluate_decoding(test_reader, model, i2w)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "If you did not run the training for the full first epoch, the output above will be a `1.0` meaning 100% string error rate. If, however, you uncommented the line to perform training for a full epoch, you should get an output of `0.569`. A string error rate of `56.9` is actually not bad for a single pass over the data. Let's now modify the above `evaluate_decoding` function to output the per-phoneme error rate. This means that we are calculating the error at a higher precision and also makes things easier in some sense because with the string error rate we could have every phoneme correct but one in each example and still end up with a 100% error rate. Here is the modified version of that function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# This decodes the test set and counts the string error rate.\n",
    "def evaluate_decoding(reader, s2smodel, i2w):\n",
    "    \n",
    "    model_decoding = create_model_greedy(s2smodel) # wrap the greedy decoder around the model\n",
    "\n",
    "    progress_printer = C.logging.ProgressPrinter(tag='Evaluation')\n",
    "\n",
    "    sparse_to_dense = create_sparse_to_dense(input_vocab_dim)\n",
    "\n",
    "    minibatch_size = 512\n",
    "    num_total = 0\n",
    "    num_wrong = 0\n",
    "    while True:\n",
    "        mb = reader.next_minibatch(minibatch_size)\n",
    "        if not mb: # finish when end of test set reached\n",
    "            break\n",
    "        e = model_decoding(mb[reader.streams.features])\n",
    "        outputs = format_sequences(e, i2w)\n",
    "        labels  = format_sequences(sparse_to_dense(mb[reader.streams.labels]), i2w)\n",
    "        # prepend sentence start for comparison\n",
    "        outputs = [\"<s> \" + output for output in outputs]\n",
    "        \n",
    "        for s in range(len(labels)):\n",
    "            for w in range(len(labels[s])):\n",
    "                num_total += 1\n",
    "                if w < len(outputs[s]): # in case the prediction is longer than the label\n",
    "                    if outputs[s][w] != labels[s][w]:\n",
    "                        num_wrong += 1\n",
    "                \n",
    "    rate = num_wrong / num_total\n",
    "    print(\"{:.1f}\".format(100 * rate))\n",
    "    return rate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "45.0\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "0.45012815036309267"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# print the phoneme error rate\n",
    "test_reader = create_reader(dataPath['testing'], False)\n",
    "evaluate_decoding(test_reader, model, i2w)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you're using the model that was trained for one full epoch, then you should get a phoneme error rate of around 10%. Not bad! This means that for each of the 383,294 phonemes in the test set, our model predicted nearly 90% of them correctly (if you used the quickly-trained version of the model then you will get an error rate of around 45%). Now, let's work with an interactive session where we can input our own input sequences and see how the model predicts their pronunciation (i.e. phonemes). Additionally, we will visualize the Decoder's attention for these samples to see which graphemes in the input it deemed to be important for each phoneme that it produces. Note that in the examples below the results will only be good if you use a model that has been trained for at least one epoch."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Interactive session\n",
    "\n",
    "Here we will write an interactive function to make it easy to interact with the trained model and try out your own input sequences that do not appear in the test set. Please note that the results will be very poor if you just use the model that was trained for a very short amount of time. The model we used just above that was trained for one epoch does a good job, and if you have the time and patience to train the model for a full 30 epochs, it will perform very nicely.\n",
    "\n",
    "We will first import some graphics libraries that make the attention visualization possible and then we will define the `translate` function that takes a numpy-based representation of the input and runs our model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# imports required for showing the attention weight heatmap\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "import pandas as pd\n",
    "\n",
    "def translate(tokens, model_decoding, vocab, i2w, show_attention=False):\n",
    "\n",
    "    vdict = {v:i for i,v in enumerate(vocab)}\n",
    "    try:\n",
    "        w = [vdict[\"<s>\"]] + [vdict[c] for c in tokens] + [vdict[\"</s>\"]]\n",
    "    except:\n",
    "        print('Input contains an unexpected token.')\n",
    "        return []\n",
    "\n",
    "    # convert to one_hot\n",
    "    query = C.Value.one_hot([w], len(vdict))\n",
    "    pred = model_decoding(query)\n",
    "    pred = pred[0] # first sequence (we only have one) -> [len, vocab size]\n",
    "    if use_attention:\n",
    "        pred = np.squeeze(pred) # attention has extra dimensions\n",
    "\n",
    "    # print out translation and stop at the sequence-end tag\n",
    "    prediction = np.argmax(pred, axis=-1)\n",
    "    translation = [i2w[i] for i in prediction]\n",
    "    \n",
    "    # show attention window (requires matplotlib, seaborn, and pandas)\n",
    "    if use_attention and show_attention:    \n",
    "        q = C.combine([model_decoding.attention_model.attention_weights])\n",
    "        att_value = q(query)\n",
    "\n",
    "        # get the attention data up to the length of the output (subset of the full window)\n",
    "        att_value = np.squeeze(att_value[0][0:len(prediction),0:len(w)])\n",
    "\n",
    "        # set up the actual words/letters for the heatmap axis labels\n",
    "        columns = [i2w[ww] for ww in prediction]\n",
    "        index = [i2w[ww] for ww in w]\n",
    "\n",
    "        dframe = pd.DataFrame(data=np.fliplr(att_value.T), columns=columns, index=index)\n",
    "        sns.heatmap(dframe)\n",
    "        plt.show()\n",
    "\n",
    "    return translation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `translate` function above takes a list of letters input by the user as `tokens`, the greedy decoding version of our model `model_decoding`, the vocabulary `vocab`, a map of index to vocab `i2w`, and the `show_attention` option which determines if we will visualize the attention vectors or not.\n",
    "\n",
    "We convert our input into a `one_hot` representation, run it through the model with `model_decoding(query)` and, since each prediction is actually a probability distribution over the entire vocabulary, we take the `argmax` to get the most probable token for each step.\n",
    "\n",
    "To visualize the attention window, we use `combine` to turn the `attention_weights` into a CNTK Function that takes the inputs that we expect. This way, when we run the function `q`, the output will be the values of the `attention_weights`. We do some data manipulation to get this data into the format that `sns` expects, and we show the visualization.\n",
    "\n",
    "Finally, we need to write the user-interaction loop which allows a user to enter multiple inputs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def interactive_session(s2smodel, vocab, i2w, show_attention=False):\n",
    "\n",
    "    model_decoding = create_model_greedy(s2smodel) # wrap the greedy decoder around the model\n",
    "\n",
    "    import sys\n",
    "\n",
    "    print('Enter one or more words to see their phonetic transcription.')\n",
    "    while True:\n",
    "        if isTest(): # Testing a prefilled text for routine testing\n",
    "            line = \"psychology\"\n",
    "        else:    \n",
    "            line = input(\"> \")\n",
    "        if line.lower() == \"quit\":\n",
    "            break\n",
    "        # tokenize. Our task is letter to sound.\n",
    "        out_line = []\n",
    "        for word in line.split():\n",
    "            in_tokens = [c.upper() for c in word]\n",
    "            out_tokens = translate(in_tokens, model_decoding, vocab, i2w, show_attention=True)\n",
    "            out_line.extend(out_tokens)\n",
    "        out_line = [\" \" if tok == '</s>' else tok[1:] for tok in out_line]\n",
    "        print(\"=\", \" \".join(out_line))\n",
    "        sys.stdout.flush()\n",
    "        if isTest(): #If test environment we will test the translation only once\n",
    "            break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above function simply creates a greedy decoder around our model and then continually asks the user for an input which we pass to our `translate` function. Visualizations of the attention will continue being appended to the notebook until you exit the loop by typing `quit`. Please uncomment the following line to try out the interaction session."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Enter one or more words to see their phonetic transcription.\n",
      "> blah\n"
     ]
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAdQAAAFRCAYAAADNWlwHAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3X9QVPf97/HXkhWpgq6aqoyXiahlmuitF0k0MxGCTDRp\nzNfYL5AgFY3GBDGZ3kFijdEJpC0Sa7FWDVZqNepoMFKbSYh1orWuSq0/8NdsGkzjT4qGqthGcO2u\ncu4fmeyVfiNL9pwVd30+nJ1hf519H2J87ft9PnvWZhiGIQAAYEpERxcAAEA4IFABALAAgQoAgAUI\nVAAALECgAgBgAQIVAAAL2IO58c82bA7m5gEA38Cg7P8O2ra/d9+jAT/32BmnhZV0HDpUAAAsENQO\nFQBwd7DZbB1dQocjUAEAptlsDDz5DQAAYAE6VACAaRFi5EugAgBM4xgqI18AACxBhwoAMC2CRUkE\nKgDAPEa+jHwBALAEHSoAwDQbq3wJVACAeRxDZeQLAIAl6FABAKaxKIlABQBYIIJAZeQLAIAV6FAB\nAKbZ6M8IVACAeRxD/QYj30uXLgWzDgAAQlq7AvXatWt68skn5XQ6g10PACAERdhsAV/CRbsC9Q9/\n+IMmTJigysrKYNcDAAhBNhN/wkW7AvX999/XtGnTZBiGGhoagl0TAAAhx2+gfvrpp/r2t7+t6Oho\npaena9OmTbejLgBACImwRQR8CRd+9+Ts2bOaPHmyJOnRRx9lJRcA4H+w2WwBX8KF34/NPPbYY5Kk\npqYm2Ww29evXT//617/UvXv3oBcHAECoaNfnUPPz85WamqrDhw+rpaVF27Zt01tvvRXs2gAAISKc\nVusGql3D63/84x96+umndeLECf3kJz9Rc3NzsOsCAIQQVvm2M1C9Xq8++ugjDRo0SI2NjQQqAAD/\nwe/It66uTi+88II+/PBDvfrqq3r11Vf10ksv3Y7aAAAhIpxW6wbK72/gk08+0d/+9jctXrxYbrdb\nMTExSk1NvQ2lAQBCBat82xGoaWlp2rVrlySpsrJSWVlZQS8KAIBQ4zdQ7Xa7hg8frl27duno0aN6\n6KGHbkddAIAQwrl82/mxmczMTE2cOFFTpkwJdj0AgBAUTqt1A9WuQI2Li1Nubq7Gjh0b7HoAAAhJ\n7f6C8ezs7GDWAQAIYeG0uChQ7Q5UAABuJZyOhQaKDw4BAGABOlQAgGksSiJQAQAW4ExJBCoA4A5m\nGIaKiop0/PhxRUZGqri4WHFxca0e43a7NXXqVM2fP1/x8fGSpPLycu3YsUNer1fZ2dlKT0/XzJkz\ndfHiRRmGofr6eiUmJqq0tFRvv/22tmzZIpvNpuTkZL388suSpJSUFPXv31+SlJiYqPz8/DZrJVAB\nAKYFa5Xv9u3b5fF4VFFRoaNHj6qkpERlZWW++10ulwoLC9XQ0OC7bf/+/Tp8+LAqKip09epVrVq1\nSpK0aNEiSdIXX3yhyZMn67XXXlNdXZ2qqqpUWVkpSZowYYLGjBmjqKgoDR48WMuXL293rfToAADT\ngnWmpJqaGiUnJ0uShg4dKpfL1ep+r9ersrIyDRgwwHfbnj17lJCQoBkzZigvL0+jRo1q9ZwlS5Zo\n4sSJ6tWrl2JjY7Vy5UrffdevX1fnzp3lcrnU0NCgSZMmKTc3V6dOnfL7O6BDBQDcsZqamhQTE+O7\nbrfb1dLSooiIL/vBxMRESV+Ohr9y+fJlnTt3TitWrFBdXZ3y8vK0detWSVJjY6P27dunuXPn+rbn\ncDgkSQsWLNADDzyg++67TxcuXFBubq4ef/xx1dTUaNasWb4u9lYIVACAacFa5RsdHd3qO7hvDtNb\ncTgcGjhwoOx2u+Lj49W5c2c1NjaqZ8+e2rp1q5566qlWI2qPx6M5c+YoJiZGRUVFkqQhQ4bonnvu\nkSQlJSXpwoULfmtl5AsAMC1YI99hw4bJ6XRKko4cOaKEhAS/tSQlJWn37t2SpIaGBl27dk09evSQ\nJO3du1cpKSmtHp+Xl6f7779fRUVFvqBdtmyZ1qxZI0mqra1VbGys39elQwUA3LFGjx6t6upq31eH\nlpSUqKqqSm63W5mZmb7H3dxxpqam6uDBg8rIyJBhGCosLPTdf/r06VarhLdv366DBw/K6/XK6XTK\nZrOpoKBAubm5euWVV+R0OmW321VSUuK3Vptx8+DZYp9t2BysTQMAvqFB2f8dtG1nD38h4Odu2P8b\nCyvpOHSoAADTOJcvx1ABALAEHSoAwDTO5UugAgAswMg3yIGav+B3wdw8OhBfJoxwEMQ1mXekD4K4\nKAl0qAAAC/Amm0AFAFiAkS+rfAEAsAQdKgDANFb5EqgAAAsw8mXkCwCAJehQAQCmscqXQAUAWICR\nLyNfAAAsQYcKADCNkS+BCgCwAB+bYeQLAIAl6FABAKZF0KASqAAA8ziGysgXAABL0KECAEzjc6gE\nKgDAAox8GfkCAGAJOlQAgGkRfA6VQAUAmMfIl5EvAACW+EYdqsfjkSRFRkYGpRgAQGhila+fDrW2\ntlbTp0/X3Llz9ec//1nJyclKTk7We++9d7vqAwCEAJst8Eu4aDNQi4qKlJOTo0ceeUQvvfSSNm7c\nqG3btmnDhg23qz4AAEJCmyPfTp066ZFHHpEkrV27Vv3795ckdenSJeiFAQBCByNfP4F686qtm4+b\ntrS0BK8iAEDI4evb/ATqZ599poKCAhmG0ernEydO3K76AAAICW0G6uLFi30/Z2Vlfe3PAADwOVQ/\ngTp8+PDbVQcAIIRxDJUzJQEALECecqYkAAAsQYcKADCNkS8dKgAAlqBDBQCYxudQCVQAgAUY+RKo\nAAALkKccQwUAwBJ0qAAA0zhTEh0qAACWoEMFAJjGoiQCFQBgAfKUQAUAWIAOlUAFANzBDMNQUVGR\njh8/rsjISBUXFysuLq7VY9xut6ZOnar58+crPj5eklReXq4dO3bI6/UqOztb6enpmjlzpi5evCjD\nMFRfX6/ExESVlpbq3Xff1caNG9WpUydNnz5dqamp+ve//61Zs2bp0qVLio6O1ptvvqkePXq0WSuL\nkgAAptlM/GnL9u3b5fF4VFFRoYKCApWUlLS63+VyaeLEiaqrq/Pdtn//fh0+fFgVFRVat26dzp8/\nL0latGiR1q5dq7feekvdu3fXa6+9posXL2rdunXauHGjVq5cqdLSUnm9Xr3zzjtKSEjQ+vXr9fTT\nT6usrMzv74BABQDcsWpqapScnCxJGjp0qFwuV6v7vV6vysrKNGDAAN9te/bsUUJCgmbMmKG8vDyN\nGjWq1XOWLFmiiRMnqlevXjp27JiSkpJkt9sVHR2t/v37q7a2VjU1NUpJSZEkpaSkaO/evX5rZeQL\nADAtWJ9DbWpqUkxMjO+63W5XS0uLIiK+7AcTExMlfTka/srly5d17tw5rVixQnV1dcrLy9PWrVsl\nSY2Njdq3b5/mzp37tdvv0qWLmpqa1NzcrOjoaElS165d1dTU5LdWAhUAYFpEkNYkRUdHq7m52Xf9\n5jC9FYfDoYEDB8putys+Pl6dO3dWY2Ojevbsqa1bt+qpp57yvQGIjo5uFZbNzc3q1q1bq9dtbm5u\nFbq3wsgXAGCazWYL+NKWYcOGyel0SpKOHDmihIQEv7UkJSVp9+7dkqSGhgZdu3bNt6Bo7969vlGu\nJH3ve99TTU2NPB6Prly5opMnT+o73/mOEhMTfa/rdDr14IMP+n1dOlQAwB1r9OjRqq6uVlZWliSp\npKREVVVVcrvdyszM9D3u5mBOTU3VwYMHlZGRIcMwVFhY6Lv/9OnTrVYJ33vvvcrJyVF2drYMw9DM\nmTMVGRmpCRMmaPbs2crOzlZkZKRKS0v91mozbh48W+y/hv4wWJtGB+O8nQgHQfzn7470wdH1Qdv2\nLzN+FvBz8yvnWVhJxwlqh3rmn38P5ubRge6mQL3b/tHlvy0CEaxjqKGEY6gAAFiAY6gAANPupsnG\nrRCoAADTyFNGvgAAWIIOFQBgGt82Q6ACACzg7yT3dwNGvgAAWIAOFQBgGhNfAhUAYAGOoTLyBQDA\nEnSoAADTOLEDgQoAsAB5ysgXAABL0KECAExj5EugAgAswNe3MfIFAMASdKgAANMY+RKoAAALkKeM\nfAEAsAQdKgDANE49SKACACzAMVRGvgAAWCKgQD1z5ozVdQAAQpjNFvglXAQUqAUFBVbXAQAIYTab\nLeBLuAgoUA3DsLoOAABCWkCLksLpHQUAwDxiwU+gzpw583+Ep2EYqqurC2pRAIDQwsdm/ARqVlbW\nN7odAIC7VZuBOnz48NtVBwAghNGgcmIHAIAFWFvDiR0AALAEHSoAwDQaVAIVAGABRr6MfAEAsAQd\nKgDANBpUAhUAYAFO7MDIFwAAS9ChAgBMo0ElUAEAFmCVLyNfAAAsQYcKADCNBpVABQBYgJEvI18A\nACxBhwoAMI0GlUAFAFiAkS8jXwAALEGHCgAwLVgNqmEYKioq0vHjxxUZGani4mLFxcW1eozb7dbU\nqVM1f/58xcfHS5LKy8u1Y8cOeb1eZWdnKz09XY2NjZo3b56uXLmiGzduaMGCBWpublZxcbFsNpsM\nw9DRo0dVVlamkSNHKiUlRf3795ckJSYmKj8/v81aCVQAgGnBGvlu375dHo9HFRUVOnr0qEpKSlRW\nVua73+VyqbCwUA0NDb7b9u/fr8OHD6uiokJXr17VqlWrJEkLFy7UuHHj9MQTT2jfvn06efKkHn30\nUa1bt06StHXrVvXt21cjR47U2bNnNXjwYC1fvrzdtTLyBQDcsWpqapScnCxJGjp0qFwuV6v7vV6v\nysrKNGDAAN9te/bsUUJCgmbMmKG8vDyNGjVKknTo0CF9/vnnmjJliqqqqjRixAjfc9xut5YuXap5\n8+ZJ+jKoGxoaNGnSJOXm5urUqVN+ayVQAQCm2WyBX9rS1NSkmJgY33W73a6Wlhbf9cTERPXp00eG\nYfhuu3z5slwul5YsWaKioiIVFBRIkurr6+VwOLR69Wr17dtX5eXlvudUVlbq+9//vrp37y5J6t27\nt3Jzc7V27Vq9+OKLmjVrlt/fQVBHvqv/7/PB3DwA4A4RrK9vi46OVnNzs+96S0uLIiLa7gUdDocG\nDhwou92u+Ph4RUVFqbGxUQ6Hw9etpqWlafHixb7nfPDBB1q6dKnv+pAhQ3TPPfdIkpKSknThwgW/\ntdKhAgBMC1aHOmzYMDmdTknSkSNHlJCQ4LeWpKQk7d69W5LU0NAgt9utHj16KCkpSTt37pQkHThw\nQIMGDZL0ZRfs9XrVp08f3zaWLVumNWvWSJJqa2sVGxvr93VZlAQAuGONHj1a1dXVysrKkiSVlJSo\nqqpKbrdbmZmZvsfdvCgqNTVVBw8eVEZGhgzDUGFhoWw2m2bPnq158+apoqJCMTExKi0tlSSdOnVK\n/fr1a/W6X415nU6n7Ha7SkpK/NZqM24ePFusZtHaYG0aAPANJc2cFLRtb3/11wE/97E3p1tYScdh\n5AsAgAUY+QIATOPMgwQqAMACtggSlUAFAJhGh8oxVAAALEGHCgAwja9vo0MFAMASdKgAANNoUAlU\nAIAFGPkSqAAAC5CnHEMFAMASdKgAAPNoUelQAQCwAh0qAMA0FiURqAAAC5CnBCoAwAKcHJ9jqAAA\nWIIOFQBgGiPfb9ihXr58WeXl5cGqBQCAkNWuDvXYsWNav3699uzZo8cffzzYNQEAQgyrfNsIVI/H\now8//FDr169XZGSkmpqa9Mc//lFRUVG3sz4AQAggT9sY+aalpen48eP6xS9+oQ0bNqh3796EKQDg\na9lstoAv4eKWHerkyZP1wQcfqL6+XhkZGTIM43bWBQBASLllh/rCCy/o/fffV05OjqqqquRyubRw\n4UJ9+umnt7M+AEAIsNkCv4QLv6t8hw8froULF2rbtm3q27evfvzjH9+OugAACCnt/hxqt27dlJOT\no5ycnGDWAwAIQeF0LDRQnNgBAGAe590jUAEA5tGh8p4CAABL0KECAEyjQaVDBQDAEnSoAADTOIZK\noAIALECeEqgAACuQqBxDBQDACnSoAADTbBF0qHSoAABYgA4VAGAah1AJVACABfjYDIEKALAAecox\nVAAALEGHCgAwjxaVDhUAACvQoQIATONzqAQqAMACTHwJVACAFUhUAhUAcOcyDENFRUU6fvy4IiMj\nVVxcrLi4uFaPcbvdmjp1qubPn6/4+HhJUnl5uXbs2CGv16vs7Gylp6ersbFR8+bN05UrV3Tjxg0t\nWLBAcXFxKi4u1qFDh9S1a1dJUllZmTp16qRZs2bp0qVLio6O1ptvvqkePXq0WWtQA/V/TxsbzM2j\nIxlGR1dw+/DOG/ArWP+bbN++XR6PRxUVFTp69KhKSkpUVlbmu9/lcqmwsFANDQ2+2/bv36/Dhw+r\noqJCV69e1apVqyRJCxcu1Lhx4/TEE09o3759OnnypOLi4vTxxx/rt7/9rRwOh28bb7/9thISEvTy\nyy9ry5YtKisr09y5c9uslVW+AIA7Vk1NjZKTkyVJQ4cOlcvlanW/1+tVWVmZBgwY4Lttz549SkhI\n0IwZM5SXl6dRo0ZJkg4dOqTPP/9cU6ZMUVVVlUaMGCHDMHTmzBm9/vrrmjBhgn73u9/5XjclJUWS\nlJKSor179/qtlZEvAMC0YK3ybWpqUkxMjO+63W5XS0uLIiK+7AcTExMlfTka/srly5d17tw5rVix\nQnV1dcrLy9PWrVtVX18vh8Oh1atX66233lJ5ebmef/555eTkaMqUKbp+/bomT56sIUOGqKmpSdHR\n0ZKkrl27qqmpyW+tdKgAANNsNlvAl7ZER0erubnZd/3mML0Vh8Oh5ORk2e12xcfHKyoqSo2NjXI4\nHL5uNS0tTR9//LG6dOminJwcde7cWV27dtWIESNUW1urmJgY3+s2Nze3CvVbIVABAObZTFzaMGzY\nMDmdTknSkSNHlJCQ4LeUpKQk7d69W5LU0NAgt9utHj16KCkpSTt37pQkHThwQIMGDdLJkyc1YcIE\nGYYhr9ermpoaDRkypNXrOp1OPfjgg35fl5EvAOCONXr0aFVXVysrK0uSVFJSoqqqKrndbmVmZvoe\nd3Onm5qaqoMHDyojI0OGYaiwsFA2m02zZ8/WvHnzVFFRoZiYGJWWliomJkbjx49XZmamOnXqpB/8\n4AcaOHCg+vXrp9mzZys7O1uRkZEqLS31W6vNMIK3XNPzxaVgbRodjVW+QMiJ7NYraNv+dG1lwM9N\nmJRhYSUdh5EvAAAWYOQLADCNLxgnUAEAVmDeSaACAMyjQ+U9BQAAlqBDBQCYRodKhwoAgCXoUAEA\n5tGgEqgAAPOCdXL8UEKgAgDM4xgqx1ABALACHSoAwDQaVDpUAAAsQYcKADCNz6ESqAAAK7DKl0AF\nAJhHh+onUN97771b3jd+/HjLiwEAIFS1GagnTpzw/fzhhx9q7NixkngnAgD4D8RC24FaUFDg+/nI\nkSOtrgMAgP+v3cdQ6UoBALdCRrAoCQBgAc7l6ydQZ86cKZvNJsMw9Nlnn7Ua+ZaWlga9OABAiKBD\nbTtQs7KyvvZnAABuxsjXT6AOHz78dtUBAEBI41y+AABYgEVJAADzmPgSqAAA81jlS6ACAKzAoiQC\nFQBgHqt8WZQEAIAlCFQAACzAyBcAYB6LkghUAIB5HEMlUAEAViBPCVQAgHl0qCxKAgDAEgQqAAAW\nYOQLADCPVb4EKgDAPI6hEqgAACsQqAQqAMA8OlQWJQEAYAkCFQAACzDyBQCYxyrf4Aaq98q/grl5\ndCTD6OgKAHxDkd16BW3bHEOlQwUAWIFAJVABAObZGPkSqACAO5dhGCoqKtLx48cVGRmp4uJixcXF\ntXqM2+3W1KlTNX/+fMXHx0uSysvLtWPHDnm9XmVnZys9PV2NjY2aN2+erly5ohs3bmjBggWKi4vT\n22+/rS1btshmsyk5OVkvv/yyJCklJUX9+/eXJCUmJio/P7/NWglUAMAda/v27fJ4PKqoqNDRo0dV\nUlKisrIy3/0ul0uFhYVqaGjw3bZ//34dPnxYFRUVunr1qlatWiVJWrhwocaNG6cnnnhC+/bt08mT\nJyVJVVVVqqyslCRNmDBBY8aMUVRUlAYPHqzly5e3u1Y+NgMAMM9mC/zShpqaGiUnJ0uShg4dKpfL\n1ep+r9ersrIyDRgwwHfbnj17lJCQoBkzZigvL0+jRo2SJB06dEiff/65pkyZoqqqKo0YMUKxsbFa\nuXKl77nXr19X586d5XK51NDQoEmTJik3N1enTp3y+ysgUAEAptlstoAvbWlqalJMTIzvut1uV0tL\ni+96YmKi+vTpI+OmTx5cvnxZLpdLS5YsUVFRkQoKCiRJ9fX1cjgcWr16tfr27avy8nLZ7XY5HA5J\n0oIFC/TAAw/ovvvuU+/evZWbm6u1a9fqxRdf1KxZs/z+Dhj5AgDMC9Iq3+joaDU3N/uut7S0KCKi\n7V7Q4XBo4MCBstvtio+PV1RUlBobG+VwOHzdalpamhYvXixJ8ng8mjNnjmJiYlRUVCRJGjJkiO65\n5x5JUlJSki5cuOC3VjpUAIBptghbwJe2DBs2TE6nU5J05MgRJSQk+K0lKSlJu3fvliQ1NDTI7Xar\nR48eSkpK0s6dOyVJBw4c0KBBgyRJeXl5uv/++1VUVOTrmJctW6Y1a9ZIkmpraxUbG+v3delQAQB3\nrNGjR6u6ulpZWVmSpJKSElVVVcntdiszM9P3uJtHx6mpqTp48KAyMjJkGIYKCwtls9k0e/ZszZs3\nT++88466deum0tJSbd++XQcPHpTX65XT6ZTNZlNBQYFyc3P1yiuvyOl0ym63q6SkxG+tNsMI3ilv\nmutPBmvT6GicKQkIOV3/18CgbbvxyP6An9vz/wy3sJKOQ4cKADCPMyURqAAACxCoBCoAwDxOjk+g\nAgCswLl8+dgMAABWIFABALAAI18AgGk2G/1Zu34DhmHo2WefVW1tbbDrAQCEoiCdHD+UtCtQq6ur\n1atXL73zzjvBrgcAEIKCdXL8UNKuQK2srNTcuXN1+vRpXb16Ndg1AQBCTYQt8EuY8BuoFy9eVHNz\ns/r166exY8eqqqrqdtQFAEBI8Ruof/rTn3wnIB47dqz+8pe/BL0oAABCzTc+Of758+fb9TU2EifH\nD2ucHB8IOcE8Of4Xf3MF/Nxu3xliYSUdp10fm1m5cqW6deumL774Qps3b1ZycrLmzJkT7NoAAKEi\njBYXBapdi5I++ugjjR8/Xrt27dKWLVv0ySefBLsuAEAosUUEfgkT7epQIyIidPHiRd17772SpGvX\nrgW1KABAaLGF0WrdQLXrrcGIESOUk5OjiRMnav78+Xr00UeDXRcAACGlzUVJmzdvVmpqqnr27Om7\nzePxKDIysl0bZ1FSGGNREhBygrko6cqpwM+kFxP/XQsr6ThtjnxjYmK0aNEiNTY2asiQIRo1apTu\nv//+21UbACBUsCjJ/8dmzp49K4fDodOnT2vnzp1yuVyKjY3VG2+84XfjdKhhjA4VCDnB7FCbznwa\n8HOj70uwsJKO43dRUm1trT777DPNmDFDMTExOnPmjF566aXbURsAIFSE0WrdQPn9DaSlpWnXrl2S\nvjynb1ZWlnr37h30wgAAocMWYQv4Ei78Bqrdbtfw4cO1a9cuHT16VA899NDtqAsAgJDSrs+hZmZm\nauLEiZoyZUqw6wEAICS1K1Dj4uKUm5ursWPHBrseAEAoYpVv+wJVkrKzs4NZBwAghIXTF4UHqt2B\nCgDALbHKl0AFAFggjFbrBoq3FAAAWIBABQDAAox8AQCmsSiJQAUAWIFFSQQqAMA8OlQCFQBgBTpU\nFiUBAGAFAhUAAAsw8gUAmBZOX8MWKAIVAGAei5IIVACAeTYWJRGoAAAL0KHKZhiG0dFFAAAQ6ujR\nAQCwAIEKAIAFCFQAACxAoAIAYAECFQAACxCoAABYIOwCddmyZXr88cc1adIk5eTkaNy4cVqxYkVH\nlxUUd9O+/v73v9eiRYskSWlpafJ4PL77Tp48qZycnI4qzXLTp0/X9OnTW90W7vu8bNkyZWZmqqWl\nxXfbs88+q3PnznVgVdbatm2bNm/e3NFlIIjCLlAlaerUqVq7dq3WrVunyspKVVZWqrGxsaPLCoq7\naV+/8nXfuxgu38V4/vx5ud1uNTU16e9//7vv9nDe56+cO3dOv/71r33Xw2H/rl+/ro8++khnz56V\n0+lUamrq1z5u06ZNunjx4u0tDpYL6TMlOZ1OrVy5Ul27dtWECRPUqVMnVVdXa/z48b7HXL58WTdu\n3FBUVFQHVmre3b6v169f993/n+ciCcVzk3zdPkrSoUOH9NhjjykqKkrr16/X7Nmzv/b5objPX/nP\nfbfb7aqurta0adO0adMmpaWl6bvf/W5I72NdXZ02bdqk/fv3KyUlRQ8//LAaGxvVs2dP/fKXv9S+\nffvU0tKiMWPGaNq0aerevbvmzJmjrl27KiMjQ4888khYvKG424R0oBqGoTVr1ujYsWNatGiROnXq\npISEBK1atUpVVVU6f/68+vTpo+LiYnXp0qWjyzXlbt/XESNGtHrM888/7/sHx+1261vf+lZHlBqw\nr9vH+fPn64033tCmTZsUERGhp556Svn5+YqMjJRhGCG/z1+51d/lLl266Cc/+YleffVVvfvuux1d\nZsA2bNigFStW6Kc//almzpwpSTp27JiGDBkiSaqqqtK6det077336r333pMkjRkzRmPGjFF9fb2K\ni4u1dOlSbdy4scP2AQEywszSpUuNiooKwzAMw+VyGWPGjDHq6+s7uKrguJv2dfPmzUZpaalhGIYx\natQow+Px+O47ceKEkZOT01GlWcbpdBojR440pk2bZjz//PNGSkqKUVlZaRhG+O7zV27+u/zmm28a\nCxcuNJ599tmQ/Pv8z3/+01i9erXxwx/+0Pj5z39unDx50vjVr35l/PWvfzUMwzBqamqM/Px8Iycn\nx/jNb35jGIZhNDU1GRs3bjQmTZpkvP7668Ynn3zSkbuAAIV0h+rP4MGD9cILLyg/Pz/s3+3dTfsq\nhfbI81ZGJttjAAABDElEQVQqKytVXFyslJQUSV+Of3/2s58pPT1dUnju89fJz89Xenp6yB5T7N69\nu5577jk999xzOnDggJYtW6YrV67oRz/6kTwej7Zu3epbYPfkk09q7NixWrhwoR5++GEtX7485CdM\nd7OwXJR0s4yMDEVHR2vDhg0dXUrQ3S37Go7Hli5duqRjx45p5MiRvtuGDRsmj8ejw4cPh+U+30pk\nZKRKSkp05cqVji7FtIceekizZ89WbGyspC/3rXv37nrmmWc0adIkJScnKzY2VosWLdIzzzxDmIY4\nvm0GAAALhH2HCgDA7UCgAgBgAQIVAAALEKgAAFiAQAUAwAIEKgAAFiBQAQCwAIEKAIAF/h96ZA5A\nt34OdAAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x1fab9428c18>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "= R R IH AH N  \n",
      "> quit\n"
     ]
    }
   ],
   "source": [
    "interactive_session(model, vocab, i2w, show_attention=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice how the attention weights show how important different parts of the input are for generating different tokens in the output. For tasks like machine translation, where the order of one-to-one words often changes due to grammatical differences between languages, this becomes very interesting as we see the attention window move further away from the diagonal that is mostly displayed in grapheme-to-phoneme translations.\n",
    "\n",
    "**What's next**\n",
    "\n",
    "With the above model, you have the basics for training a powerful sequence-to-sequence model with attention in a number of distinct domains. The only major changes required are preparing a dataset with pairs input and output sequences and in general the rest of the building blocks will remain the same. Good luck, and have fun!"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}