Repository: barronalex/Dynamic-Memory-Networks-in-TensorFlow Branch: master Commit: 6b35d5b397f7 Files: 9 Total size: 38.8 KB Directory structure: gitextract_74facvp9/ ├── .gitignore ├── LICENSE.txt ├── README.md ├── attention_gru_cell.py ├── babi_input.py ├── dmn_plus.py ├── dmn_test.py ├── dmn_train.py └── fetch_babi_data.sh ================================================ FILE CONTENTS ================================================ ================================================ FILE: .gitignore ================================================ /data /papers /weights /summaries *.swp *.pyc *.zip *.xlsx *.gz dmn_original.py ================================================ FILE: LICENSE.txt ================================================ The MIT License (MIT) Copyright (c) 2016 Alex Barron Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ================================================ FILE: README.md ================================================ # Dynamic Memory Networks in TensorFlow DMN+ implementation in TensorFlow for question answering on the bAbI 10k dataset. Structure and parameters from [Dynamic Memory Networks for Visual and Textual Question Answering](https://arxiv.org/abs/1603.01417) which is henceforth referred to as Xiong et al. Adapted from Stanford's [cs224d](http://cs224d.stanford.edu/) assignment 2 starter code and using methods from [Dynamic Memory Networks in Theano](https://github.com/YerevaNN/Dynamic-memory-networks-in-Theano) for importing the Babi-10k dataset. ## Repository Contents | file | description | | --- | --- | | `dmn_plus.py` | contains the DMN+ model | | `dmn_train.py` | trains the model on a specified (-b) babi task| | `dmn_test.py` | tests the model on a specified (-b) babi task | | `babi_input.py` | prepares bAbI data for input into DMN | | `attention_gru_cell.py` | contains a custom Attention GRU cell implementation | | `fetch_babi_data.sh` | shell script to fetch bAbI tasks (from [DMNs in Theano](https://github.com/YerevaNN/Dynamic-memory-networks-in-Theano)) | ## Usage Install [TensorFlow r1.4](https://www.tensorflow.org/install/) Run the included shell script to fetch the data bash fetch_babi_data.sh Use 'dmn_train.py' to train the DMN+ model contained in 'dmn_plus.py' python dmn_train.py --babi_task_id 2 Once training is finished, test the model on a specified task python dmn_test.py --babi_task_id 2 The l2 regularization constant can be set with -l2-loss (-l). All other parameters were specified by [Xiong et al](https://arxiv.org/abs/1603.01417) and can be found in the 'Config' class in 'dmn_plus.py'. ## Benchmarks The TensorFlow DMN+ reaches close to state of the art performance on the 10k dataset with weak supervision (no supporting facts). Each task was trained on separately with l2 = 0.001. As the paper suggests, 10 training runs were used for tasks 2, 3, 17 and 18 (configurable with --num-runs), where the weights which produce the lowest validation loss in any run are used for testing. The pre-trained weights which achieve these benchmarks are available in 'pretrained'. I haven't yet had the time to fully optimize the l2 parameter which is not specified by the paper. My hypothesis is that fully optimizing l2 regularization would close the final significant performance gap between the TensorFlow DMN+ and original DMN+ on task 3. Below are the full results for each bAbI task (tasks where both implementations achieved 0 test error are omitted): | Task ID | TensorFlow DMN+| Xiong et al DMN+ | | :---: | :---: | :---: | | 2 | 0.9 | 0.3 | | 3 | 18.4 | 1.1 | | 5 | 0.5 | 0.5 | | 7 | 2.8 | 2.4 | | 8 | 0.5 | 0.0 | | 9 | 0.1 | 0.0 | | 14 | 0.0 | 0.2 | | 16 | 46.2 | 45.3 | | 17 | 5.0 | 4.2 | | 18 | 2.2 | 2.1 | ================================================ FILE: attention_gru_cell.py ================================================ from __future__ import absolute_import from __future__ import division from __future__ import print_function import collections import math from tensorflow.python.framework import ops from tensorflow.python.ops import array_ops from tensorflow.python.ops import clip_ops from tensorflow.python.ops import embedding_ops from tensorflow.python.ops import init_ops from tensorflow.python.ops import math_ops from tensorflow.python.ops import nn_ops from tensorflow.python.ops import partitioned_variables from tensorflow.python.ops import variable_scope as vs from tensorflow.python.ops.math_ops import sigmoid from tensorflow.python.ops.math_ops import tanh from tensorflow.python.ops.rnn_cell_impl import RNNCell from tensorflow.python.platform import tf_logging as logging from tensorflow.python.util import nest class AttentionGRUCell(RNNCell): """Gated Recurrent Unit incoporating attention (cf. https://arxiv.org/abs/1603.01417). Adapted from https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/rnn/python/ops/core_rnn_cell_impl.py NOTE: Takes an input of shape: (batch_size, max_time_step, input_dim + 1) Where an input vector of shape: (batch_size, max_time_step, input_dim) and scalar attention of shape: (batch_size, max_time_step, 1) are concatenated along the final axis""" def __init__(self, num_units, input_size=None, activation=tanh): if input_size is not None: logging.warn("%s: The input_size parameter is deprecated.", self) self._num_units = num_units self._activation = activation @property def state_size(self): return self._num_units @property def output_size(self): return self._num_units def __call__(self, inputs, state, scope=None): """Attention GRU with nunits cells.""" with vs.variable_scope(scope or "attention_gru_cell"): with vs.variable_scope("gates"): # Reset gate and update gate. # We start with bias of 1.0 to not reset and not update. if inputs.get_shape()[-1] != self._num_units + 1: raise ValueError("Input should be passed as word input concatenated with 1D attention on end axis") # extract input vector and attention inputs, g = array_ops.split(inputs, num_or_size_splits=[self._num_units,1], axis=1) r = _linear([inputs, state], self._num_units, True) r = sigmoid(r) with vs.variable_scope("candidate"): r = r*_linear(state, self._num_units, False) with vs.variable_scope("input"): x = _linear(inputs, self._num_units, True) h_hat = self._activation(r + x) new_h = (1 - g) * state + g * h_hat return new_h, new_h def _linear(args, output_size, bias, bias_start=0.0): """Linear map: sum_i(args[i] * W[i]), where W[i] is a variable. Args: args: a 2D Tensor or a list of 2D, batch x n, Tensors. output_size: int, second dimension of W[i]. bias: boolean, whether to add a bias term or not. bias_start: starting value to initialize the bias; 0 by default. Returns: A 2D Tensor with shape [batch x output_size] equal to sum_i(args[i] * W[i]), where W[i]s are newly created matrices. Raises: ValueError: if some of the arguments has unspecified or wrong shape. """ if args is None or (nest.is_sequence(args) and not args): raise ValueError("`args` must be specified") if not nest.is_sequence(args): args = [args] # Calculate the total size of arguments on dimension 1. total_arg_size = 0 shapes = [a.get_shape() for a in args] for shape in shapes: if shape.ndims != 2: raise ValueError("linear is expecting 2D arguments: %s" % shapes) if shape[1].value is None: raise ValueError("linear expects shape[1] to be provided for shape %s, " "but saw %s" % (shape, shape[1])) else: total_arg_size += shape[1].value dtype = [a.dtype for a in args][0] # Now the computation. scope = vs.get_variable_scope() with vs.variable_scope(scope) as outer_scope: weights = vs.get_variable( "weights", [total_arg_size, output_size], dtype=dtype) if len(args) == 1: res = math_ops.matmul(args[0], weights) else: res = math_ops.matmul(array_ops.concat(args, 1), weights) if not bias: return res with vs.variable_scope(outer_scope) as inner_scope: inner_scope.set_partitioner(None) biases = vs.get_variable( "biases", [output_size], dtype=dtype, initializer=init_ops.constant_initializer(bias_start, dtype=dtype)) return nn_ops.bias_add(res, biases) ================================================ FILE: babi_input.py ================================================ from __future__ import division from __future__ import print_function import sys import os as os import numpy as np # can be sentence or word input_mask_mode = "sentence" # adapted from https://github.com/YerevaNN/Dynamic-memory-networks-in-Theano/ def init_babi(fname): print("==> Loading test from %s" % fname) tasks = [] task = None for i, line in enumerate(open(fname)): id = int(line[0:line.find(' ')]) if id == 1: task = {"C": "", "Q": "", "A": "", "S": ""} counter = 0 id_map = {} line = line.strip() line = line.replace('.', ' . ') line = line[line.find(' ')+1:] # if not a question if line.find('?') == -1: task["C"] += line id_map[id] = counter counter += 1 else: idx = line.find('?') tmp = line[idx+1:].split('\t') task["Q"] = line[:idx] task["A"] = tmp[1].strip() task["S"] = [] for num in tmp[2].split(): task["S"].append(id_map[int(num.strip())]) tasks.append(task.copy()) return tasks def get_babi_raw(id, test_id): babi_map = { "1": "qa1_single-supporting-fact", "2": "qa2_two-supporting-facts", "3": "qa3_three-supporting-facts", "4": "qa4_two-arg-relations", "5": "qa5_three-arg-relations", "6": "qa6_yes-no-questions", "7": "qa7_counting", "8": "qa8_lists-sets", "9": "qa9_simple-negation", "10": "qa10_indefinite-knowledge", "11": "qa11_basic-coreference", "12": "qa12_conjunction", "13": "qa13_compound-coreference", "14": "qa14_time-reasoning", "15": "qa15_basic-deduction", "16": "qa16_basic-induction", "17": "qa17_positional-reasoning", "18": "qa18_size-reasoning", "19": "qa19_path-finding", "20": "qa20_agents-motivations", "MCTest": "MCTest", "19changed": "19changed", "joint": "all_shuffled", "sh1": "../shuffled/qa1_single-supporting-fact", "sh2": "../shuffled/qa2_two-supporting-facts", "sh3": "../shuffled/qa3_three-supporting-facts", "sh4": "../shuffled/qa4_two-arg-relations", "sh5": "../shuffled/qa5_three-arg-relations", "sh6": "../shuffled/qa6_yes-no-questions", "sh7": "../shuffled/qa7_counting", "sh8": "../shuffled/qa8_lists-sets", "sh9": "../shuffled/qa9_simple-negation", "sh10": "../shuffled/qa10_indefinite-knowledge", "sh11": "../shuffled/qa11_basic-coreference", "sh12": "../shuffled/qa12_conjunction", "sh13": "../shuffled/qa13_compound-coreference", "sh14": "../shuffled/qa14_time-reasoning", "sh15": "../shuffled/qa15_basic-deduction", "sh16": "../shuffled/qa16_basic-induction", "sh17": "../shuffled/qa17_positional-reasoning", "sh18": "../shuffled/qa18_size-reasoning", "sh19": "../shuffled/qa19_path-finding", "sh20": "../shuffled/qa20_agents-motivations", } if (test_id == ""): test_id = id babi_name = babi_map[id] babi_test_name = babi_map[test_id] babi_train_raw = init_babi(os.path.join(os.path.dirname(os.path.realpath(__file__)), 'data/en-10k/%s_train.txt' % babi_name)) babi_test_raw = init_babi(os.path.join(os.path.dirname(os.path.realpath(__file__)), 'data/en-10k/%s_test.txt' % babi_test_name)) return babi_train_raw, babi_test_raw def load_glove(dim): word2vec = {} print("==> loading glove") with open(("./data/glove/glove.6B/glove.6B." + str(dim) + "d.txt")) as f: for line in f: l = line.split() word2vec[l[0]] = map(float, l[1:]) print("==> glove is loaded") return word2vec def create_vector(word, word2vec, word_vector_size, silent=True): # if the word is missing from Glove, create some fake vector and store in glove! vector = np.random.uniform(0.0,1.0,(word_vector_size,)) word2vec[word] = vector if (not silent): print("utils.py::create_vector => %s is missing" % word) return vector def process_word(word, word2vec, vocab, ivocab, word_vector_size, to_return="word2vec", silent=True): if not word in word2vec: create_vector(word, word2vec, word_vector_size, silent) if not word in vocab: next_index = len(vocab) vocab[word] = next_index ivocab[next_index] = word if to_return == "word2vec": return word2vec[word] elif to_return == "index": return vocab[word] elif to_return == "onehot": raise Exception("to_return = 'onehot' is not implemented yet") def process_input(data_raw, floatX, word2vec, vocab, ivocab, embed_size, split_sentences=False): questions = [] inputs = [] answers = [] input_masks = [] for x in data_raw: if split_sentences: inp = x["C"].lower().split(' . ') inp = [w for w in inp if len(w) > 0] inp = [i.split() for i in inp] else: inp = x["C"].lower().split(' ') inp = [w for w in inp if len(w) > 0] q = x["Q"].lower().split(' ') q = [w for w in q if len(w) > 0] if split_sentences: inp_vector = [[process_word(word = w, word2vec = word2vec, vocab = vocab, ivocab = ivocab, word_vector_size = embed_size, to_return = "index") for w in s] for s in inp] else: inp_vector = [process_word(word = w, word2vec = word2vec, vocab = vocab, ivocab = ivocab, word_vector_size = embed_size, to_return = "index") for w in inp] q_vector = [process_word(word = w, word2vec = word2vec, vocab = vocab, ivocab = ivocab, word_vector_size = embed_size, to_return = "index") for w in q] if split_sentences: inputs.append(inp_vector) else: inputs.append(np.vstack(inp_vector).astype(floatX)) questions.append(np.vstack(q_vector).astype(floatX)) answers.append(process_word(word = x["A"], word2vec = word2vec, vocab = vocab, ivocab = ivocab, word_vector_size = embed_size, to_return = "index")) # NOTE: here we assume the answer is one word! if not split_sentences: if input_mask_mode == 'word': input_masks.append(np.array([index for index, w in enumerate(inp)], dtype=np.int32)) elif input_mask_mode == 'sentence': input_masks.append(np.array([index for index, w in enumerate(inp) if w == '.'], dtype=np.int32)) else: raise Exception("invalid input_mask_mode") return inputs, questions, answers, input_masks def get_lens(inputs, split_sentences=False): lens = np.zeros((len(inputs)), dtype=int) for i, t in enumerate(inputs): lens[i] = t.shape[0] return lens def get_sentence_lens(inputs): lens = np.zeros((len(inputs)), dtype=int) sen_lens = [] max_sen_lens = [] for i, t in enumerate(inputs): sentence_lens = np.zeros((len(t)), dtype=int) for j, s in enumerate(t): sentence_lens[j] = len(s) lens[i] = len(t) sen_lens.append(sentence_lens) max_sen_lens.append(np.max(sentence_lens)) return lens, sen_lens, max(max_sen_lens) def pad_inputs(inputs, lens, max_len, mode="", sen_lens=None, max_sen_len=None): if mode == "mask": padded = [np.pad(inp, (0, max_len - lens[i]), 'constant', constant_values=0) for i, inp in enumerate(inputs)] return np.vstack(padded) elif mode == "split_sentences": padded = np.zeros((len(inputs), max_len, max_sen_len)) for i, inp in enumerate(inputs): padded_sentences = [np.pad(s, (0, max_sen_len - sen_lens[i][j]), 'constant', constant_values=0) for j, s in enumerate(inp)] # trim array according to max allowed inputs if len(padded_sentences) > max_len: padded_sentences = padded_sentences[(len(padded_sentences)-max_len):] lens[i] = max_len padded_sentences = np.vstack(padded_sentences) padded_sentences = np.pad(padded_sentences, ((0, max_len - lens[i]),(0,0)), 'constant', constant_values=0) padded[i] = padded_sentences return padded padded = [np.pad(np.squeeze(inp, axis=1), (0, max_len - lens[i]), 'constant', constant_values=0) for i, inp in enumerate(inputs)] return np.vstack(padded) def create_embedding(word2vec, ivocab, embed_size): embedding = np.zeros((len(ivocab), embed_size)) for i in range(len(ivocab)): word = ivocab[i] embedding[i] = word2vec[word] return embedding def load_babi(config, split_sentences=False): vocab = {} ivocab = {} babi_train_raw, babi_test_raw = get_babi_raw(config.babi_id, config.babi_test_id) if config.word2vec_init: assert config.embed_size == 100 word2vec = load_glove(config.embed_size) else: word2vec = {} # set word at index zero to be end of sentence token so padding with zeros is consistent process_word(word = "", word2vec = word2vec, vocab = vocab, ivocab = ivocab, word_vector_size = config.embed_size, to_return = "index") print('==> get train inputs') train_data = process_input(babi_train_raw, config.floatX, word2vec, vocab, ivocab, config.embed_size, split_sentences) print('==> get test inputs') test_data = process_input(babi_test_raw, config.floatX, word2vec, vocab, ivocab, config.embed_size, split_sentences) if config.word2vec_init: assert config.embed_size == 100 word_embedding = create_embedding(word2vec, ivocab, config.embed_size) else: word_embedding = np.random.uniform(-config.embedding_init, config.embedding_init, (len(ivocab), config.embed_size)) inputs, questions, answers, input_masks = train_data if config.train_mode else test_data if split_sentences: input_lens, sen_lens, max_sen_len = get_sentence_lens(inputs) max_mask_len = max_sen_len else: input_lens = get_lens(inputs) mask_lens = get_lens(input_masks) max_mask_len = np.max(mask_lens) q_lens = get_lens(questions) max_q_len = np.max(q_lens) max_input_len = min(np.max(input_lens), config.max_allowed_inputs) #pad out arrays to max if split_sentences: inputs = pad_inputs(inputs, input_lens, max_input_len, "split_sentences", sen_lens, max_sen_len) input_masks = np.zeros(len(inputs)) else: inputs = pad_inputs(inputs, input_lens, max_input_len) input_masks = pad_inputs(input_masks, mask_lens, max_mask_len, "mask") questions = pad_inputs(questions, q_lens, max_q_len) answers = np.stack(answers) if config.train_mode: train = questions[:config.num_train], inputs[:config.num_train], q_lens[:config.num_train], input_lens[:config.num_train], input_masks[:config.num_train], answers[:config.num_train] valid = questions[config.num_train:], inputs[config.num_train:], q_lens[config.num_train:], input_lens[config.num_train:], input_masks[config.num_train:], answers[config.num_train:] return train, valid, word_embedding, max_q_len, max_input_len, max_mask_len, len(vocab) else: test = questions, inputs, q_lens, input_lens, input_masks, answers return test, word_embedding, max_q_len, max_input_len, max_mask_len, len(vocab) ================================================ FILE: dmn_plus.py ================================================ from __future__ import print_function from __future__ import division import sys import time import numpy as np from copy import deepcopy import tensorflow as tf from attention_gru_cell import AttentionGRUCell from tensorflow.contrib.cudnn_rnn.python.ops import cudnn_rnn_ops import babi_input class Config(object): """Holds model hyperparams and data information.""" batch_size = 100 embed_size = 80 hidden_size = 80 max_epochs = 256 early_stopping = 20 dropout = 0.9 lr = 0.001 l2 = 0.001 cap_grads = False max_grad_val = 10 noisy_grads = False word2vec_init = False embedding_init = np.sqrt(3) # NOTE not currently used hence non-sensical anneal_threshold anneal_threshold = 1000 anneal_by = 1.5 num_hops = 3 num_attention_features = 4 max_allowed_inputs = 130 num_train = 9000 floatX = np.float32 babi_id = "1" babi_test_id = "" train_mode = True def _add_gradient_noise(t, stddev=1e-3, name=None): """Adds gradient noise as described in http://arxiv.org/abs/1511.06807 The input Tensor `t` should be a gradient. The output will be `t` + gaussian noise. 0.001 was said to be a good fixed value for memory networks.""" with tf.variable_scope('gradient_noise'): gn = tf.random_normal(tf.shape(t), stddev=stddev) return tf.add(t, gn) # from https://github.com/domluna/memn2n def _position_encoding(sentence_size, embedding_size): """We could have used RNN for parsing sentence but that tends to overfit. The simpler choice would be to take sum of embedding but we loose loose positional information. Position encoding is described in section 4.1 in "End to End Memory Networks" in more detail (http://arxiv.org/pdf/1503.08895v5.pdf)""" encoding = np.ones((embedding_size, sentence_size), dtype=np.float32) ls = sentence_size+1 le = embedding_size+1 for i in range(1, le): for j in range(1, ls): encoding[i-1, j-1] = (i - (le-1)/2) * (j - (ls-1)/2) encoding = 1 + 4 * encoding / embedding_size / sentence_size return np.transpose(encoding) class DMN_PLUS(object): def load_data(self, debug=False): """Loads train/valid/test data and sentence encoding""" if self.config.train_mode: self.train, self.valid, self.word_embedding, self.max_q_len, self.max_sentences, self.max_sen_len, self.vocab_size = babi_input.load_babi(self.config, split_sentences=True) else: self.test, self.word_embedding, self.max_q_len, self.max_sentences, self.max_sen_len, self.vocab_size = babi_input.load_babi(self.config, split_sentences=True) self.encoding = _position_encoding(self.max_sen_len, self.config.embed_size) def add_placeholders(self): """add data placeholder to graph""" self.question_placeholder = tf.placeholder(tf.int32, shape=(self.config.batch_size, self.max_q_len)) self.input_placeholder = tf.placeholder(tf.int32, shape=(self.config.batch_size, self.max_sentences, self.max_sen_len)) self.question_len_placeholder = tf.placeholder(tf.int32, shape=(self.config.batch_size,)) self.input_len_placeholder = tf.placeholder(tf.int32, shape=(self.config.batch_size,)) self.answer_placeholder = tf.placeholder(tf.int64, shape=(self.config.batch_size,)) self.dropout_placeholder = tf.placeholder(tf.float32) def get_predictions(self, output): preds = tf.nn.softmax(output) pred = tf.argmax(preds, 1) return pred def add_loss_op(self, output): """Calculate loss""" loss = tf.reduce_sum(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=output, labels=self.answer_placeholder)) # add l2 regularization for all variables except biases for v in tf.trainable_variables(): if not 'bias' in v.name.lower(): loss += self.config.l2*tf.nn.l2_loss(v) tf.summary.scalar('loss', loss) return loss def add_training_op(self, loss): """Calculate and apply gradients""" opt = tf.train.AdamOptimizer(learning_rate=self.config.lr) gvs = opt.compute_gradients(loss) # optionally cap and noise gradients to regularize if self.config.cap_grads: gvs = [(tf.clip_by_norm(grad, self.config.max_grad_val), var) for grad, var in gvs] if self.config.noisy_grads: gvs = [(_add_gradient_noise(grad), var) for grad, var in gvs] train_op = opt.apply_gradients(gvs) return train_op def get_question_representation(self): """Get question vectors via embedding and GRU""" questions = tf.nn.embedding_lookup(self.embeddings, self.question_placeholder) gru_cell = tf.contrib.rnn.GRUCell(self.config.hidden_size) _, q_vec = tf.nn.dynamic_rnn(gru_cell, questions, dtype=np.float32, sequence_length=self.question_len_placeholder ) return q_vec def get_input_representation(self): """Get fact (sentence) vectors via embedding, positional encoding and bi-directional GRU""" # get word vectors from embedding inputs = tf.nn.embedding_lookup(self.embeddings, self.input_placeholder) # use encoding to get sentence representation inputs = tf.reduce_sum(inputs * self.encoding, 2) forward_gru_cell = tf.contrib.rnn.GRUCell(self.config.hidden_size) backward_gru_cell = tf.contrib.rnn.GRUCell(self.config.hidden_size) outputs, _ = tf.nn.bidirectional_dynamic_rnn( forward_gru_cell, backward_gru_cell, inputs, dtype=np.float32, sequence_length=self.input_len_placeholder ) # sum forward and backward output vectors fact_vecs = tf.reduce_sum(tf.stack(outputs), axis=0) # apply dropout fact_vecs = tf.nn.dropout(fact_vecs, self.dropout_placeholder) return fact_vecs def get_attention(self, q_vec, prev_memory, fact_vec, reuse): """Use question vector and previous memory to create scalar attention for current fact""" with tf.variable_scope("attention", reuse=reuse): features = [fact_vec*q_vec, fact_vec*prev_memory, tf.abs(fact_vec - q_vec), tf.abs(fact_vec - prev_memory)] feature_vec = tf.concat(features, 1) attention = tf.contrib.layers.fully_connected(feature_vec, self.config.embed_size, activation_fn=tf.nn.tanh, reuse=reuse, scope="fc1") attention = tf.contrib.layers.fully_connected(attention, 1, activation_fn=None, reuse=reuse, scope="fc2") return attention def generate_episode(self, memory, q_vec, fact_vecs, hop_index): """Generate episode by applying attention to current fact vectors through a modified GRU""" attentions = [tf.squeeze( self.get_attention(q_vec, memory, fv, bool(hop_index) or bool(i)), axis=1) for i, fv in enumerate(tf.unstack(fact_vecs, axis=1))] attentions = tf.transpose(tf.stack(attentions)) self.attentions.append(attentions) attentions = tf.nn.softmax(attentions) attentions = tf.expand_dims(attentions, axis=-1) reuse = True if hop_index > 0 else False # concatenate fact vectors and attentions for input into attGRU gru_inputs = tf.concat([fact_vecs, attentions], 2) with tf.variable_scope('attention_gru', reuse=reuse): _, episode = tf.nn.dynamic_rnn(AttentionGRUCell(self.config.hidden_size), gru_inputs, dtype=np.float32, sequence_length=self.input_len_placeholder ) return episode def add_answer_module(self, rnn_output, q_vec): """Linear softmax answer module""" rnn_output = tf.nn.dropout(rnn_output, self.dropout_placeholder) output = tf.layers.dense(tf.concat([rnn_output, q_vec], 1), self.vocab_size, activation=None) return output def inference(self): """Performs inference on the DMN model""" # input fusion module with tf.variable_scope("question", initializer=tf.contrib.layers.xavier_initializer()): print('==> get question representation') q_vec = self.get_question_representation() with tf.variable_scope("input", initializer=tf.contrib.layers.xavier_initializer()): print('==> get input representation') fact_vecs = self.get_input_representation() # keep track of attentions for possible strong supervision self.attentions = [] # memory module with tf.variable_scope("memory", initializer=tf.contrib.layers.xavier_initializer()): print('==> build episodic memory') # generate n_hops episodes prev_memory = q_vec for i in range(self.config.num_hops): # get a new episode print('==> generating episode', i) episode = self.generate_episode(prev_memory, q_vec, fact_vecs, i) # untied weights for memory update with tf.variable_scope("hop_%d" % i): prev_memory = tf.layers.dense(tf.concat([prev_memory, episode, q_vec], 1), self.config.hidden_size, activation=tf.nn.relu) output = prev_memory # pass memory module output through linear answer module with tf.variable_scope("answer", initializer=tf.contrib.layers.xavier_initializer()): output = self.add_answer_module(output, q_vec) return output def run_epoch(self, session, data, num_epoch=0, train_writer=None, train_op=None, verbose=2, train=False): config = self.config dp = config.dropout if train_op is None: # train_op = tf.no_op() dp = 1 total_steps = len(data[0]) // config.batch_size total_loss = [] accuracy = 0 # shuffle data p = np.random.permutation(len(data[0])) qp, ip, ql, il, im, a = data qp, ip, ql, il, im, a = qp[p], ip[p], ql[p], il[p], im[p], a[p] for step in range(total_steps): index = range(step*config.batch_size,(step+1)*config.batch_size) feed = {self.question_placeholder: qp[index], self.input_placeholder: ip[index], self.question_len_placeholder: ql[index], self.input_len_placeholder: il[index], self.answer_placeholder: a[index], self.dropout_placeholder: dp} if train_op is None: loss, pred, summary, = session.run( [self.calculate_loss, self.pred, self.merged], feed_dict=feed) else: loss, pred, summary, _ = session.run( [self.calculate_loss, self.pred, self.merged, train_op], feed_dict=feed) if train_writer is not None: train_writer.add_summary(summary, num_epoch*total_steps + step) answers = a[step*config.batch_size:(step+1)*config.batch_size] accuracy += np.sum(pred == answers)/float(len(answers)) total_loss.append(loss) if verbose and step % verbose == 0: sys.stdout.write('\r{} / {} : loss = {}'.format( step, total_steps, np.mean(total_loss))) sys.stdout.flush() if verbose: sys.stdout.write('\r') return np.mean(total_loss), accuracy/float(total_steps) def __init__(self, config): self.config = config self.variables_to_save = {} self.load_data(debug=False) self.add_placeholders() # set up embedding self.embeddings = tf.Variable(self.word_embedding.astype(np.float32), name="Embedding") self.output = self.inference() self.pred = self.get_predictions(self.output) self.calculate_loss = self.add_loss_op(self.output) self.train_step = self.add_training_op(self.calculate_loss) self.merged = tf.summary.merge_all() ================================================ FILE: dmn_test.py ================================================ from __future__ import print_function from __future__ import division import tensorflow as tf import numpy as np import time import argparse parser = argparse.ArgumentParser() parser.add_argument("-b", "--babi_task_id", help="specify babi task 1-20 (default=1)") parser.add_argument("-t", "--dmn_type", help="specify type of dmn (default=original)") args = parser.parse_args() dmn_type = args.dmn_type if args.dmn_type is not None else "plus" if dmn_type == "original": from dmn_original import Config config = Config() elif dmn_type == "plus": from dmn_plus import Config config = Config() else: raise NotImplementedError(dmn_type + ' DMN type is not currently implemented') if args.babi_task_id is not None: config.babi_id = args.babi_task_id config.strong_supervision = False config.train_mode = False print( 'Testing DMN ' + dmn_type + ' on babi task', config.babi_id) # create model with tf.variable_scope('DMN') as scope: if dmn_type == "original": from dmn_original import DMN model = DMN(config) elif dmn_type == "plus": from dmn_plus import DMN_PLUS model = DMN_PLUS(config) print('==> initializing variables') init = tf.global_variables_initializer() saver = tf.train.Saver() with tf.Session() as session: session.run(init) print('==> restoring weights') saver.restore(session, 'weights/task' + str(model.config.babi_id) + '.weights') print('==> running DMN') test_loss, test_accuracy = model.run_epoch(session, model.test) print('') print('Test accuracy:', test_accuracy) ================================================ FILE: dmn_train.py ================================================ from __future__ import print_function from __future__ import division import tensorflow as tf import time import argparse import os parser = argparse.ArgumentParser() parser.add_argument("-b", "--babi_task_id", help="specify babi task 1-20 (default=1)") parser.add_argument("-r", "--restore", help="restore previously trained weights (default=false)") parser.add_argument("-s", "--strong_supervision", help="use labelled supporting facts (default=false)") parser.add_argument("-t", "--dmn_type", help="specify type of dmn (default=original)") parser.add_argument("-l", "--l2_loss", type=float, default=0.001, help="specify l2 loss constant") parser.add_argument("-n", "--num_runs", type=int, help="specify the number of model runs") args = parser.parse_args() dmn_type = args.dmn_type if args.dmn_type is not None else "plus" if dmn_type == "plus": from dmn_plus import Config config = Config() else: raise NotImplementedError(dmn_type + ' DMN type is not currently implemented') if args.babi_task_id is not None: config.babi_id = args.babi_task_id config.babi_id = args.babi_task_id if args.babi_task_id is not None else str(1) config.l2 = args.l2_loss if args.l2_loss is not None else 0.001 config.strong_supervision = args.strong_supervision if args.strong_supervision is not None else False num_runs = args.num_runs if args.num_runs is not None else 1 print('Training DMN ' + dmn_type + ' on babi task', config.babi_id) best_overall_val_loss = float('inf') # create model with tf.variable_scope('DMN') as scope: if dmn_type == "plus": from dmn_plus import DMN_PLUS model = DMN_PLUS(config) for run in range(num_runs): print('Starting run', run) print('==> initializing variables') init = tf.global_variables_initializer() saver = tf.train.Saver() with tf.Session() as session: sum_dir = 'summaries/train/' + time.strftime("%Y-%m-%d %H %M") if not os.path.exists(sum_dir): os.makedirs(sum_dir) train_writer = tf.summary.FileWriter(sum_dir, session.graph) session.run(init) best_val_epoch = 0 prev_epoch_loss = float('inf') best_val_loss = float('inf') best_val_accuracy = 0.0 if args.restore: print('==> restoring weights') saver.restore(session, 'weights/task' + str(model.config.babi_id) + '.weights') print('==> starting training') for epoch in range(config.max_epochs): print('Epoch {}'.format(epoch)) start = time.time() train_loss, train_accuracy = model.run_epoch( session, model.train, epoch, train_writer, train_op=model.train_step, train=True) valid_loss, valid_accuracy = model.run_epoch(session, model.valid) print('Training loss: {}'.format(train_loss)) print('Validation loss: {}'.format(valid_loss)) print('Training accuracy: {}'.format(train_accuracy)) print('Vaildation accuracy: {}'.format(valid_accuracy)) if valid_loss < best_val_loss: best_val_loss = valid_loss best_val_epoch = epoch if best_val_loss < best_overall_val_loss: print('Saving weights') best_overall_val_loss = best_val_loss best_val_accuracy = valid_accuracy saver.save(session, 'weights/task' + str(model.config.babi_id) + '.weights') # anneal if train_loss > prev_epoch_loss * model.config.anneal_threshold: model.config.lr /= model.config.anneal_by print('annealed lr to %f' % model.config.lr) prev_epoch_loss = train_loss if epoch - best_val_epoch > config.early_stopping: break print('Total time: {}'.format(time.time() - start)) print('Best validation accuracy:', best_val_accuracy) ================================================ FILE: fetch_babi_data.sh ================================================ #!/bin/bash url=http://www.thespermwhale.com/jaseweston/babi/tasks_1-20_v1-2.tar.gz fname=`basename $url` curl -SLO $url tar zxvf $fname mkdir -p data mv tasks_1-20_v1-2/* data/ rm -r tasks_1-20_v1-2 rm tasks_1-20_v1-2.tar.gz mkdir weights