Repository: HKUST-KnowComp/MnemonicReader Branch: master Commit: 76aeb1d9021e Files: 19 Total size: 144.2 KB Directory structure: gitextract_qftjbr90/ ├── .gitignore ├── LICENSE ├── README.md ├── config.py ├── data.py ├── layers.py ├── m_reader.py ├── model.py ├── predictor.py ├── r_net.py ├── rnn_reader.py ├── script/ │ ├── evaluate-v1.1.py │ ├── interactive.py │ ├── predict.py │ ├── preprocess.py │ └── train.py ├── spacy_tokenizer.py ├── utils.py └── vector.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: .gitignore ================================================ *.pyc *.DS_Store *~ data/ *.tar.gz *.egg-info ================================================ FILE: LICENSE ================================================ BSD 3-Clause License Copyright (c) 2018, HKUST-KnowComp All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ================================================ FILE: README.md ================================================ # Mnemonic Reader The Mnemonic Reader is a deep learning model for Machine Comprehension task. You can get details from this [paper](https://arxiv.org/pdf/1705.02798.pdf). It combines advantages of [match-LSTM](https://arxiv.org/pdf/1608.07905), [R-Net](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/05/r-net.pdf) and [Document Reader](https://arxiv.org/abs/1704.00051) and utilizes a new unit, the Semantic Fusion Unit (SFU), to achieve state-of-the-art results (at that time). This model is a [PyTorch](http://pytorch.org/) implementation of Mnemonic Reader. At the same time, a PyTorch implementation of R-Net and a PyTorch implementation of Document Reader are also included to compare with the Mnemonic Reader. Pretrained models are also available in [release](https://github.com/HKUST-KnowComp/MnemonicReader/releases). This repo belongs to [HKUST-KnowComp](https://github.com/HKUST-KnowComp) and is under the [BSD LICENSE](LICENSE). Some codes are implemented based on [DrQA](https://github.com/facebookresearch/DrQA). Please feel free to contact with Xin Liu (xliucr@connect.ust.hk) if you have any question about this repo. ### Evaluation on SQuAD | Model | DEV_EM | DEV_F1 | | ------------------------------------- | ------ | ------ | | Document Reader (original paper) | 69.5 | 78.8 | | Document Reader (trained model) | 69.4 | 78.6 | | R-Net (original paper 1) | 71.1 | 79.5 | | R-Net (original paper 2) | 72.3 | 80.6 | | R-Net (trained model) | 70.2 | 79.4 | | Mnemonic Reader (original paper) | 71.8 | 81.2 | | Mnemonic Reader + RL (original paper) | 72.1 | 81.6 | | Mnemonic Reader (trained model) | 73.2 | 81.5 | ![EM_F1](img/EM_F1.png) ### Requirements * Python >= 3.4 * PyTorch >= 0.31 * spaCy >= 2.0.0 * tqdm * ujson * numpy * prettytable ### Prepare First of all, you need to download the dataset and pre-trained word vectors. ```bash mkdir -p data/datasets wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json -O data/datasets/SQuAD-train-v1.1.json wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json -O data/datasets/SQuAD-dev-v1.1.json ``` ```bash mkdir -p data/embeddings wget http://nlp.stanford.edu/data/glove.840B.300d.zip -O data/embeddings/glove.840B.300d.zip cd data/embeddings unzip glove.840B.300d.zip ``` Then, you need to preprocess these data. ```bash python script/preprocess.py data/datasets data/datasets --split SQuAD-train-v1.1 python script/preprocess.py data/datasets data/datasets --split SQuAD-dev-v1.1 ``` If you want to use multicores to speed up, you could add `--num-workers 4` in commands. ### Train There are some parameters to set but default values are ready. If you are not interested in tuning parameters, you can use default values. Just run: ```bash python script/train.py ``` After several hours, you will get the model in `data/models/`, e.g. `20180416-acc9d06d.mdl` and you can see the log file in `data/models/`, e.g. `20180416-acc9d06d.txt`. ### Predict To evaluate the model you get, you should complete this part. ```bash python script/predict.py --model data/models/20180416-acc9d06d.mdl ``` You need to change the model name in the command above. You will not get results directly but to use the official `evaluate-v1.1.py` in `data/script`. ```bash python script/evaluate-v1.1.py data/predict/SQuAD-dev-v1.1-20180416-acc9d06d.preds data/datasets/SQuAD-dev-v1.1.json ``` ### Interactivate In order to help those who are interested in QA systems, `script/interactivate.py` provides an easy but good demo. ```bash python script/interactivate.py --model data/models/20180416-acc9d06d.mdl ``` Then you will drop into an interactive session. It looks like: ``` * Interactive Module * * Repo: Mnemonic Reader (https://github.com/HKUST-KnowComp/MnemonicReader) * Implement based on Facebook's DrQA >>> process(document, question, candidates=None, top_n=1) >>> usage() >>> text="Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend \"Venite Ad Me Omnes\". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary." >>> question = "To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?" >>> process(text, question) +------+----------------------------+-----------+ | Rank | Span | Score | +------+----------------------------+-----------+ | 1 | Saint Bernadette Soubirous | 0.9875301 | +------+----------------------------+-----------+ ``` ### More parameters If you want to tune parameters to achieve a higher score, you can get instructions about parameters via using ```bash python script/preprocess.py --help ``` ```bash python script/train.py --help ``` ```bash python script/predict.py --help ``` ```bash python script/interactivate.py --help ``` ## License All codes in **Mnemonic Reader** are under [BSD LICENSE](LICENSE). ================================================ FILE: config.py ================================================ #!/usr/bin/env python3 # Copyright 2018-present, HKUST-KnowComp. # All rights reserved. # # This source code is licensed under the license found in the # LICENSE file in the root directory of this source tree. """Model architecture/optimization options for WRMCQA document reader.""" import argparse import logging logger = logging.getLogger(__name__) # Index of arguments concerning the core model architecture MODEL_ARCHITECTURE = { 'model_type', 'embedding_dim', 'char_embedding_dim', 'hidden_size', 'char_hidden_size', 'doc_layers', 'question_layers', 'rnn_type', 'concat_rnn_layers', 'question_merge', 'use_qemb', 'use_exact_match', 'use_pos', 'use_ner', 'use_lemma', 'use_tf', 'hop' } # Index of arguments concerning the model optimizer/training MODEL_OPTIMIZER = { 'fix_embeddings', 'optimizer', 'learning_rate', 'momentum', 'weight_decay', 'rho', 'eps', 'max_len', 'grad_clipping', 'tune_partial', 'rnn_padding', 'dropout_rnn', 'dropout_rnn_output', 'dropout_emb' } def str2bool(v): return v.lower() in ('yes', 'true', 't', '1', 'y') def add_model_args(parser): parser.register('type', 'bool', str2bool) # Model architecture model = parser.add_argument_group('Reader Model Architecture') model.add_argument('--model-type', type=str, default='mnemonic', help='Model architecture type: rnn, r_net, mnemonic') model.add_argument('--embedding-dim', type=int, default=300, help='Embedding size if embedding_file is not given') model.add_argument('--char-embedding-dim', type=int, default=50, help='Embedding size if char_embedding_file is not given') model.add_argument('--hidden-size', type=int, default=100, help='Hidden size of RNN units') model.add_argument('--char-hidden-size', type=int, default=50, help='Hidden size of char RNN units') model.add_argument('--doc-layers', type=int, default=3, help='Number of encoding layers for document') model.add_argument('--question-layers', type=int, default=3, help='Number of encoding layers for question') model.add_argument('--rnn-type', type=str, default='lstm', help='RNN type: LSTM, GRU, or RNN') # Model specific details detail = parser.add_argument_group('Reader Model Details') detail.add_argument('--concat-rnn-layers', type='bool', default=True, help='Combine hidden states from each encoding layer') detail.add_argument('--question-merge', type=str, default='self_attn', help='The way of computing the question representation') detail.add_argument('--use-qemb', type='bool', default=True, help='Whether to use weighted question embeddings') detail.add_argument('--use-exact-match', type='bool', default=True, help='Whether to use in_question_* features') detail.add_argument('--use-pos', type='bool', default=True, help='Whether to use pos features') detail.add_argument('--use-ner', type='bool', default=True, help='Whether to use ner features') detail.add_argument('--use-lemma', type='bool', default=True, help='Whether to use lemma features') detail.add_argument('--use-tf', type='bool', default=True, help='Whether to use term frequency features') detail.add_argument('--hop', type=int, default=2, help='The number of hops for both aligner and the answer pointer in m-reader') # Optimization details optim = parser.add_argument_group('Reader Optimization') optim.add_argument('--dropout-emb', type=float, default=0.2, help='Dropout rate for word embeddings') optim.add_argument('--dropout-rnn', type=float, default=0.2, help='Dropout rate for RNN states') optim.add_argument('--dropout-rnn-output', type='bool', default=True, help='Whether to dropout the RNN output') optim.add_argument('--optimizer', type=str, default='adamax', help='Optimizer: sgd, adamax, adadelta') optim.add_argument('--learning-rate', type=float, default=1.0, help='Learning rate for sgd, adadelta') optim.add_argument('--grad-clipping', type=float, default=10, help='Gradient clipping') optim.add_argument('--weight-decay', type=float, default=0, help='Weight decay factor') optim.add_argument('--momentum', type=float, default=0, help='Momentum factor') optim.add_argument('--rho', type=float, default=0.95, help='Rho for adadelta') optim.add_argument('--eps', type=float, default=1e-6, help='Eps for adadelta') optim.add_argument('--fix-embeddings', type='bool', default=True, help='Keep word embeddings fixed (use pretrained)') optim.add_argument('--tune-partial', type=int, default=0, help='Backprop through only the top N question words') optim.add_argument('--rnn-padding', type='bool', default=False, help='Explicitly account for padding in RNN encoding') optim.add_argument('--max-len', type=int, default=15, help='The max span allowed during decoding') def get_model_args(args): """Filter args for model ones. From a args Namespace, return a new Namespace with *only* the args specific to the model architecture or optimization. (i.e. the ones defined here.) """ global MODEL_ARCHITECTURE, MODEL_OPTIMIZER required_args = MODEL_ARCHITECTURE | MODEL_OPTIMIZER arg_values = {k: v for k, v in vars(args).items() if k in required_args} return argparse.Namespace(**arg_values) def override_model_args(old_args, new_args): """Set args to new parameters. Decide which model args to keep and which to override when resolving a set of saved args and new args. We keep the new optimation, but leave the model architecture alone. """ global MODEL_OPTIMIZER old_args, new_args = vars(old_args), vars(new_args) for k in old_args.keys(): if k in new_args and old_args[k] != new_args[k]: if k in MODEL_OPTIMIZER: logger.info('Overriding saved %s: %s --> %s' % (k, old_args[k], new_args[k])) old_args[k] = new_args[k] else: logger.info('Keeping saved %s: %s' % (k, old_args[k])) return argparse.Namespace(**old_args) ================================================ FILE: data.py ================================================ #!/usr/bin/env python3 # Copyright 2018-present, HKUST-KnowComp. # All rights reserved. # # This source code is licensed under the license found in the # LICENSE file in the root directory of this source tree. """Data processing/loading helpers.""" import numpy as np import logging import unicodedata from torch.utils.data import Dataset from torch.utils.data.sampler import Sampler from vector import vectorize logger = logging.getLogger(__name__) # ------------------------------------------------------------------------------ # Dictionary class for tokens. # ------------------------------------------------------------------------------ class Dictionary(object): NULL = '' UNK = '' START = 2 @staticmethod def normalize(token): return unicodedata.normalize('NFD', token) def __init__(self): self.tok2ind = {self.NULL: 0, self.UNK: 1} self.ind2tok = {0: self.NULL, 1: self.UNK} def __len__(self): return len(self.tok2ind) def __iter__(self): return iter(self.tok2ind) def __contains__(self, key): if type(key) == int: return key in self.ind2tok elif type(key) == str: return self.normalize(key) in self.tok2ind def __getitem__(self, key): if type(key) == int: return self.ind2tok.get(key, self.UNK) if type(key) == str: return self.tok2ind.get(self.normalize(key), self.tok2ind.get(self.UNK)) def __setitem__(self, key, item): if type(key) == int and type(item) == str: self.ind2tok[key] = item elif type(key) == str and type(item) == int: self.tok2ind[key] = item else: raise RuntimeError('Invalid (key, item) types.') def add(self, token): token = self.normalize(token) if token not in self.tok2ind: index = len(self.tok2ind) self.tok2ind[token] = index self.ind2tok[index] = token def tokens(self): """Get dictionary tokens. Return all the words indexed by this dictionary, except for special tokens. """ tokens = [k for k in self.tok2ind.keys() if k not in {'', ''}] return tokens # ------------------------------------------------------------------------------ # PyTorch dataset class for SQuAD (and SQuAD-like) data. # ------------------------------------------------------------------------------ class ReaderDataset(Dataset): def __init__(self, examples, model, single_answer=False): self.model = model self.examples = examples self.single_answer = single_answer def __len__(self): return len(self.examples) def __getitem__(self, index): return vectorize(self.examples[index], self.model, self.single_answer) def lengths(self): return [(len(ex['document']), len(ex['question'])) for ex in self.examples] # ------------------------------------------------------------------------------ # PyTorch sampler returning batched of sorted lengths (by doc and question). # ------------------------------------------------------------------------------ class SortedBatchSampler(Sampler): def __init__(self, lengths, batch_size, shuffle=True): self.lengths = lengths self.batch_size = batch_size self.shuffle = shuffle def __iter__(self): lengths = np.array( [(-l[0], -l[1], np.random.random()) for l in self.lengths], dtype=[('l1', np.int_), ('l2', np.int_), ('rand', np.float_)] ) indices = np.argsort(lengths, order=('l1', 'l2', 'rand')) batches = [indices[i:i + self.batch_size] for i in range(0, len(indices), self.batch_size)] if self.shuffle: np.random.shuffle(batches) return iter([i for batch in batches for i in batch]) def __len__(self): return len(self.lengths) ================================================ FILE: layers.py ================================================ #!/usr/bin/env python3 # Copyright 2018-present, HKUST-KnowComp. # All rights reserved. # # This source code is licensed under the license found in the # LICENSE file in the root directory of this source tree. """Definitions of model layers/NN modules""" import torch import torch.nn as nn import torch.nn.functional as F from torch.autograd import Variable import math import random # ------------------------------------------------------------------------------ # Modules # ------------------------------------------------------------------------------ class StackedBRNN(nn.Module): """Stacked Bi-directional RNNs. Differs from standard PyTorch library in that it has the option to save and concat the hidden states between layers. (i.e. the output hidden size for each sequence input is num_layers * hidden_size). """ def __init__(self, input_size, hidden_size, num_layers, dropout_rate=0, dropout_output=False, rnn_type=nn.LSTM, concat_layers=False, padding=False): super(StackedBRNN, self).__init__() self.padding = padding self.dropout_output = dropout_output self.dropout_rate = dropout_rate self.num_layers = num_layers self.concat_layers = concat_layers self.rnns = nn.ModuleList() for i in range(num_layers): input_size = input_size if i == 0 else 2 * hidden_size self.rnns.append(rnn_type(input_size, hidden_size, num_layers=1, bidirectional=True)) def forward(self, x, x_mask): """Encode either padded or non-padded sequences. Can choose to either handle or ignore variable length sequences. Always handle padding in eval. Args: x: batch * len * hdim x_mask: batch * len (1 for padding, 0 for true) Output: x_encoded: batch * len * hdim_encoded """ if x_mask.data.sum() == 0 or x_mask.data.eq(1).long().sum(1).min() == 0: # No padding necessary. output = self._forward_unpadded(x, x_mask) elif self.padding or not self.training: # Pad if we care or if its during eval. output = self._forward_padded(x, x_mask) else: # We don't care. output = self._forward_unpadded(x, x_mask) return output.contiguous() def _forward_unpadded(self, x, x_mask): """Faster encoding that ignores any padding.""" # Transpose batch and sequence dims x = x.transpose(0, 1) # Encode all layers outputs = [x] for i in range(self.num_layers): rnn_input = outputs[-1] # Apply dropout to hidden input if self.dropout_rate > 0: rnn_input = F.dropout(rnn_input, p=self.dropout_rate, training=self.training) # Forward rnn_output = self.rnns[i](rnn_input)[0] outputs.append(rnn_output) # Concat hidden layers if self.concat_layers: output = torch.cat(outputs[1:], 2) else: output = outputs[-1] # Transpose back output = output.transpose(0, 1) # Dropout on output layer if self.dropout_output and self.dropout_rate > 0: output = F.dropout(output, p=self.dropout_rate, training=self.training) return output def _forward_padded(self, x, x_mask): """Slower (significantly), but more precise, encoding that handles padding. """ # Compute sorted sequence lengths lengths = x_mask.data.eq(0).long().sum(1).squeeze() _, idx_sort = torch.sort(lengths, dim=0, descending=True) _, idx_unsort = torch.sort(idx_sort, dim=0) lengths = list(lengths[idx_sort]) idx_sort = Variable(idx_sort) idx_unsort = Variable(idx_unsort) # Sort x x = x.index_select(0, idx_sort) # Transpose batch and sequence dims x = x.transpose(0, 1) # Pack it up rnn_input = nn.utils.rnn.pack_padded_sequence(x, lengths) # Encode all layers outputs = [rnn_input] for i in range(self.num_layers): rnn_input = outputs[-1] # Apply dropout to input if self.dropout_rate > 0: dropout_input = F.dropout(rnn_input.data, p=self.dropout_rate, training=self.training) rnn_input = nn.utils.rnn.PackedSequence(dropout_input, rnn_input.batch_sizes) outputs.append(self.rnns[i](rnn_input)[0]) # Unpack everything for i, o in enumerate(outputs[1:], 1): outputs[i] = nn.utils.rnn.pad_packed_sequence(o)[0] # Concat hidden layers or take final if self.concat_layers: output = torch.cat(outputs[1:], 2) else: output = outputs[-1] # Transpose and unsort output = output.transpose(0, 1) output = output.index_select(0, idx_unsort) # Pad up to original batch sequence length if output.size(1) != x_mask.size(1): padding = torch.zeros(output.size(0), x_mask.size(1) - output.size(1), output.size(2)).type(output.data.type()) output = torch.cat([output, Variable(padding)], 1) # Dropout on output layer if self.dropout_output and self.dropout_rate > 0: output = F.dropout(output, p=self.dropout_rate, training=self.training) return output class FeedForwardNetwork(nn.Module): def __init__(self, input_size, hidden_size, output_size, dropout_rate=0): super(FeedForwardNetwork, self).__init__() self.dropout_rate = dropout_rate self.linear1 = nn.Linear(input_size, hidden_size) self.linear2 = nn.Linear(hidden_size, output_size) def forward(self, x): x_proj = F.dropout(F.relu(self.linear1(x)), p=self.dropout_rate, training=self.training) x_proj = self.linear2(x_proj) return x_proj class PointerNetwork(nn.Module): def __init__(self, x_size, y_size, hidden_size, dropout_rate=0, cell_type=nn.GRUCell, normalize=True): super(PointerNetwork, self).__init__() self.normalize = normalize self.hidden_size = hidden_size self.dropout_rate = dropout_rate self.linear = nn.Linear(x_size+y_size, hidden_size, bias=False) self.weights = nn.Linear(hidden_size, 1, bias=False) self.self_attn = NonLinearSeqAttn(y_size, hidden_size) self.cell = cell_type(x_size, y_size) def init_hiddens(self, y, y_mask): attn = self.self_attn(y, y_mask) res = attn.unsqueeze(1).bmm(y).squeeze(1) # [B, I] return res def pointer(self, x, state, x_mask): x_ = torch.cat([x, state.unsqueeze(1).repeat(1,x.size(1),1)], 2) s0 = F.tanh(self.linear(x_)) s = self.weights(s0).view(x.size(0), x.size(1)) s.data.masked_fill_(x_mask.data, -float('inf')) a = F.softmax(s) res = a.unsqueeze(1).bmm(x).squeeze(1) if self.normalize: if self.training: # In training we output log-softmax for NLL scores = F.log_softmax(s) else: # ...Otherwise 0-1 probabilities scores = F.softmax(s) else: scores = a.exp() return res, scores def forward(self, x, y, x_mask, y_mask): hiddens = self.init_hiddens(y, y_mask) c, start_scores = self.pointer(x, hiddens, x_mask) c_ = F.dropout(c, p=self.dropout_rate, training=self.training) hiddens = self.cell(c_, hiddens) c, end_scores = self.pointer(x, hiddens, x_mask) return start_scores, end_scores class MemoryAnsPointer(nn.Module): def __init__(self, x_size, y_size, hidden_size, hop=1, dropout_rate=0, normalize=True): super(MemoryAnsPointer, self).__init__() self.normalize = normalize self.hidden_size = hidden_size self.hop = hop self.dropout_rate = dropout_rate self.FFNs_start = nn.ModuleList() self.SFUs_start = nn.ModuleList() self.FFNs_end = nn.ModuleList() self.SFUs_end = nn.ModuleList() for i in range(self.hop): self.FFNs_start.append(FeedForwardNetwork(x_size+y_size+2*hidden_size, hidden_size, 1, dropout_rate)) self.SFUs_start.append(SFU(y_size, 2*hidden_size)) self.FFNs_end.append(FeedForwardNetwork(x_size+y_size+2*hidden_size, hidden_size, 1, dropout_rate)) self.SFUs_end.append(SFU(y_size, 2*hidden_size)) def forward(self, x, y, x_mask, y_mask): z_s = y[:,-1,:].unsqueeze(1) # [B, 1, I] z_e = None s = None e = None p_s = None p_e = None for i in range(self.hop): z_s_ = z_s.repeat(1,x.size(1),1) # [B, S, I] s = self.FFNs_start[i](torch.cat([x, z_s_, x*z_s_], 2)).squeeze(2) s.data.masked_fill_(x_mask.data, -float('inf')) p_s = F.softmax(s, dim=1) # [B, S] u_s = p_s.unsqueeze(1).bmm(x) # [B, 1, I] z_e = self.SFUs_start[i](z_s, u_s) # [B, 1, I] z_e_ = z_e.repeat(1,x.size(1),1) # [B, S, I] e = self.FFNs_end[i](torch.cat([x, z_e_, x*z_e_], 2)).squeeze(2) e.data.masked_fill_(x_mask.data, -float('inf')) p_e = F.softmax(e, dim=1) # [B, S] u_e = p_e.unsqueeze(1).bmm(x) # [B, 1, I] z_s = self.SFUs_end[i](z_e, u_e) if self.normalize: if self.training: # In training we output log-softmax for NLL p_s = F.log_softmax(s, dim=1) # [B, S] p_e = F.log_softmax(e, dim=1) # [B, S] else: # ...Otherwise 0-1 probabilities p_s = F.softmax(s, dim=1) # [B, S] p_e = F.softmax(e, dim=1) # [B, S] else: p_s = s.exp() p_e = e.exp() return p_s, p_e # ------------------------------------------------------------------------------ # Attentions # ------------------------------------------------------------------------------ class SeqAttnMatch(nn.Module): """Given sequences X and Y, match sequence Y to each element in X. * o_i = sum(alpha_j * y_j) for i in X * alpha_j = softmax(y_j * x_i) """ def __init__(self, input_size, identity=False): super(SeqAttnMatch, self).__init__() if not identity: self.linear = nn.Linear(input_size, input_size) else: self.linear = None def forward(self, x, y, y_mask): """ Args: x: batch * len1 * hdim y: batch * len2 * hdim y_mask: batch * len2 (1 for padding, 0 for true) Output: matched_seq: batch * len1 * hdim """ # Project vectors if self.linear: x_proj = self.linear(x.view(-1, x.size(2))).view(x.size()) x_proj = F.relu(x_proj) y_proj = self.linear(y.view(-1, y.size(2))).view(y.size()) y_proj = F.relu(y_proj) else: x_proj = x y_proj = y # Compute scores scores = x_proj.bmm(y_proj.transpose(2, 1)) # Mask padding y_mask = y_mask.unsqueeze(1).expand(scores.size()) scores.data.masked_fill_(y_mask.data, -float('inf')) # Normalize with softmax alpha = F.softmax(scores, dim=2) # Take weighted average matched_seq = alpha.bmm(y) return matched_seq class SelfAttnMatch(nn.Module): """Given sequences X and Y, match sequence Y to each element in X. * o_i = sum(alpha_j * x_j) for i in X * alpha_j = softmax(x_j * x_i) """ def __init__(self, input_size, identity=False, diag=True): super(SelfAttnMatch, self).__init__() if not identity: self.linear = nn.Linear(input_size, input_size) else: self.linear = None self.diag = diag def forward(self, x, x_mask): """ Args: x: batch * len1 * dim1 x_mask: batch * len1 (1 for padding, 0 for true) Output: matched_seq: batch * len1 * dim1 """ # Project vectors if self.linear: x_proj = self.linear(x.view(-1, x.size(2))).view(x.size()) x_proj = F.relu(x_proj) else: x_proj = x # Compute scores scores = x_proj.bmm(x_proj.transpose(2, 1)) if not self.diag: x_len = x.size(1) for i in range(x_len): scores[:, i, i] = 0 # Mask padding x_mask = x_mask.unsqueeze(1).expand(scores.size()) scores.data.masked_fill_(x_mask.data, -float('inf')) # Normalize with softmax alpha = F.softmax(scores, dim=2) # Take weighted average matched_seq = alpha.bmm(x) return matched_seq class BilinearSeqAttn(nn.Module): """A bilinear attention layer over a sequence X w.r.t y: * o_i = softmax(x_i'Wy) for x_i in X. Optionally don't normalize output weights. """ def __init__(self, x_size, y_size, identity=False, normalize=True): super(BilinearSeqAttn, self).__init__() self.normalize = normalize # If identity is true, we just use a dot product without transformation. if not identity: self.linear = nn.Linear(y_size, x_size) else: self.linear = None def forward(self, x, y, x_mask): """ Args: x: batch * len * hdim1 y: batch * hdim2 x_mask: batch * len (1 for padding, 0 for true) Output: alpha = batch * len """ Wy = self.linear(y) if self.linear is not None else y xWy = x.bmm(Wy.unsqueeze(2)).squeeze(2) xWy.data.masked_fill_(x_mask.data, -float('inf')) if self.normalize: if self.training: # In training we output log-softmax for NLL alpha = F.log_softmax(xWy) else: # ...Otherwise 0-1 probabilities alpha = F.softmax(xWy) else: alpha = xWy.exp() return alpha class LinearSeqAttn(nn.Module): """Self attention over a sequence: * o_i = softmax(Wx_i) for x_i in X. """ def __init__(self, input_size): super(LinearSeqAttn, self).__init__() self.linear = nn.Linear(input_size, 1) def forward(self, x, x_mask): """ Args: x: batch * len * hdim x_mask: batch * len (1 for padding, 0 for true) Output: alpha: batch * len """ x_flat = x.view(-1, x.size(-1)) scores = self.linear(x_flat).view(x.size(0), x.size(1)) scores.data.masked_fill_(x_mask.data, -float('inf')) alpha = F.softmax(scores) return alpha class NonLinearSeqAttn(nn.Module): """Self attention over a sequence: * o_i = softmax(function(Wx_i)) for x_i in X. """ def __init__(self, input_size, hidden_size): super(NonLinearSeqAttn, self).__init__() self.FFN = FeedForwardNetwork(input_size, hidden_size, 1) def forward(self, x, x_mask): """ Args: x: batch * len * dim x_mask: batch * len (1 for padding, 0 for true) Output: alpha: batch * len """ scores = self.FFN(x).squeeze(2) scores.data.masked_fill_(x_mask.data, -float('inf')) alpha = F.softmax(scores) return alpha # ------------------------------------------------------------------------------ # Functional Units # ------------------------------------------------------------------------------ class Gate(nn.Module): """Gate Unit g = sigmoid(Wx) x = g * x """ def __init__(self, input_size): super(Gate, self).__init__() self.linear = nn.Linear(input_size, input_size, bias=False) def forward(self, x): """ Args: x: batch * len * dim x_mask: batch * len (1 for padding, 0 for true) Output: res: batch * len * dim """ x_proj = self.linear(x) gate = F.sigmoid(x) return x_proj * gate class SFU(nn.Module): """Semantic Fusion Unit The ouput vector is expected to not only retrieve correlative information from fusion vectors, but also retain partly unchange as the input vector """ def __init__(self, input_size, fusion_size): super(SFU, self).__init__() self.linear_r = nn.Linear(input_size + fusion_size, input_size) self.linear_g = nn.Linear(input_size + fusion_size, input_size) def forward(self, x, fusions): r_f = torch.cat([x, fusions], 2) r = F.tanh(self.linear_r(r_f)) g = F.sigmoid(self.linear_g(r_f)) o = g * r + (1-g) * x return o # ------------------------------------------------------------------------------ # Functional # ------------------------------------------------------------------------------ def uniform_weights(x, x_mask): """Return uniform weights over non-masked x (a sequence of vectors). Args: x: batch * len * hdim x_mask: batch * len (1 for padding, 0 for true) Output: x_avg: batch * hdim """ alpha = Variable(torch.ones(x.size(0), x.size(1))) if x.data.is_cuda: alpha = alpha.cuda() alpha = alpha * x_mask.eq(0).float() alpha = alpha / alpha.sum(1).expand(alpha.size()) return alpha def weighted_avg(x, weights): """Return a weighted average of x (a sequence of vectors). Args: x: batch * len * hdim weights: batch * len, sum(dim = 1) = 1 Output: x_avg: batch * hdim """ return weights.unsqueeze(1).bmm(x).squeeze(1) ================================================ FILE: m_reader.py ================================================ #!/usr/bin/env python3 # Copyright 2018-present, HKUST-KnowComp. # All rights reserved. # # This source code is licensed under the license found in the # LICENSE file in the root directory of this source tree. """Implementation of the Mnemonic Reader.""" import torch import torch.nn as nn import torch.nn.functional as F import layers from torch.autograd import Variable # ------------------------------------------------------------------------------ # Network # ------------------------------------------------------------------------------ class MnemonicReader(nn.Module): RNN_TYPES = {'lstm': nn.LSTM, 'gru': nn.GRU, 'rnn': nn.RNN} CELL_TYPES = {'lstm': nn.LSTMCell, 'gru': nn.GRUCell, 'rnn': nn.RNNCell} def __init__(self, args, normalize=True): super(MnemonicReader, self).__init__() # Store config self.args = args # Word embeddings (+1 for padding) self.embedding = nn.Embedding(args.vocab_size, args.embedding_dim, padding_idx=0) # Char embeddings (+1 for padding) self.char_embedding = nn.Embedding(args.char_size, args.char_embedding_dim, padding_idx=0) # Char rnn to generate char features self.char_rnn = layers.StackedBRNN( input_size=args.char_embedding_dim, hidden_size=args.char_hidden_size, num_layers=1, dropout_rate=args.dropout_rnn, dropout_output=args.dropout_rnn_output, concat_layers=False, rnn_type=self.RNN_TYPES[args.rnn_type], padding=False, ) doc_input_size = args.embedding_dim + args.char_hidden_size * 2 + args.num_features # Encoder self.encoding_rnn = layers.StackedBRNN( input_size=doc_input_size, hidden_size=args.hidden_size, num_layers=1, dropout_rate=args.dropout_rnn, dropout_output=args.dropout_rnn_output, concat_layers=False, rnn_type=self.RNN_TYPES[args.rnn_type], padding=args.rnn_padding, ) doc_hidden_size = 2 * args.hidden_size # Interactive aligning, self aligning and aggregating self.interactive_aligners = nn.ModuleList() self.interactive_SFUs = nn.ModuleList() self.self_aligners = nn.ModuleList() self.self_SFUs = nn.ModuleList() self.aggregate_rnns = nn.ModuleList() for i in range(args.hop): # interactive aligner self.interactive_aligners.append(layers.SeqAttnMatch(doc_hidden_size, identity=True)) self.interactive_SFUs.append(layers.SFU(doc_hidden_size, 3 * doc_hidden_size)) # self aligner self.self_aligners.append(layers.SelfAttnMatch(doc_hidden_size, identity=True, diag=False)) self.self_SFUs.append(layers.SFU(doc_hidden_size, 3 * doc_hidden_size)) # aggregating self.aggregate_rnns.append( layers.StackedBRNN( input_size=doc_hidden_size, hidden_size=args.hidden_size, num_layers=1, dropout_rate=args.dropout_rnn, dropout_output=args.dropout_rnn_output, concat_layers=False, rnn_type=self.RNN_TYPES[args.rnn_type], padding=args.rnn_padding, ) ) # Memmory-based Answer Pointer self.mem_ans_ptr = layers.MemoryAnsPointer( x_size=2*args.hidden_size, y_size=2*args.hidden_size, hidden_size=args.hidden_size, hop=args.hop, dropout_rate=args.dropout_rnn, normalize=normalize ) def forward(self, x1, x1_c, x1_f, x1_mask, x2, x2_c, x2_f, x2_mask): """Inputs: x1 = document word indices [batch * len_d] x1_c = document char indices [batch * len_d] x1_f = document word features indices [batch * len_d * nfeat] x1_mask = document padding mask [batch * len_d] x2 = question word indices [batch * len_q] x2_c = document char indices [batch * len_d] x1_f = document word features indices [batch * len_d * nfeat] x2_mask = question padding mask [batch * len_q] """ # Embed both document and question x1_emb = self.embedding(x1) x2_emb = self.embedding(x2) x1_c_emb = self.char_embedding(x1_c) x2_c_emb = self.char_embedding(x2_c) # Dropout on embeddings if self.args.dropout_emb > 0: x1_emb = F.dropout(x1_emb, p=self.args.dropout_emb, training=self.training) x2_emb = F.dropout(x2_emb, p=self.args.dropout_emb, training=self.training) x1_c_emb = F.dropout(x1_c_emb, p=self.args.dropout_emb, training=self.training) x2_c_emb = F.dropout(x2_c_emb, p=self.args.dropout_emb, training=self.training) # Generate char features x1_c_features = self.char_rnn( x1_c_emb.reshape((x1_c_emb.size(0) * x1_c_emb.size(1), x1_c_emb.size(2), x1_c_emb.size(3))), x1_mask.unsqueeze(2).repeat(1, 1, x1_c_emb.size(2)).reshape((x1_c_emb.size(0) * x1_c_emb.size(1), x1_c_emb.size(2))) ).reshape((x1_c_emb.size(0), x1_c_emb.size(1), x1_c_emb.size(2), -1))[:,:,-1,:] x2_c_features = self.char_rnn( x2_c_emb.reshape((x2_c_emb.size(0) * x2_c_emb.size(1), x2_c_emb.size(2), x2_c_emb.size(3))), x2_mask.unsqueeze(2).repeat(1, 1, x2_c_emb.size(2)).reshape((x2_c_emb.size(0) * x2_c_emb.size(1), x2_c_emb.size(2))) ).reshape((x2_c_emb.size(0), x2_c_emb.size(1), x2_c_emb.size(2), -1))[:,:,-1,:] # Combine input crnn_input = [x1_emb, x1_c_features] qrnn_input = [x2_emb, x2_c_features] # Add manual features if self.args.num_features > 0: crnn_input.append(x1_f) qrnn_input.append(x2_f) # Encode document with RNN c = self.encoding_rnn(torch.cat(crnn_input, 2), x1_mask) # Encode question with RNN q = self.encoding_rnn(torch.cat(qrnn_input, 2), x2_mask) # Align and aggregate c_check = c for i in range(self.args.hop): q_tilde = self.interactive_aligners[i].forward(c_check, q, x2_mask) c_bar = self.interactive_SFUs[i].forward(c_check, torch.cat([q_tilde, c_check * q_tilde, c_check - q_tilde], 2)) c_tilde = self.self_aligners[i].forward(c_bar, x1_mask) c_hat = self.self_SFUs[i].forward(c_bar, torch.cat([c_tilde, c_bar * c_tilde, c_bar - c_tilde], 2)) c_check = self.aggregate_rnns[i].forward(c_hat, x1_mask) # Predict start_scores, end_scores = self.mem_ans_ptr.forward(c_check, q, x1_mask, x2_mask) return start_scores, end_scores ================================================ FILE: model.py ================================================ #!/usr/bin/env python3 # Copyright 2018-present, HKUST-KnowComp. # All rights reserved. # # This source code is licensed under the license found in the # LICENSE file in the root directory of this source tree. """Document Reader model""" import torch import torch.optim as optim import torch.nn.functional as F import numpy as np import logging import copy from torch.autograd import Variable from config import override_model_args from r_net import R_Net from rnn_reader import RnnDocReader from m_reader import MnemonicReader from data import Dictionary logger = logging.getLogger(__name__) class DocReader(object): """High level model that handles intializing the underlying network architecture, saving, updating examples, and predicting examples. """ # -------------------------------------------------------------------------- # Initialization # -------------------------------------------------------------------------- def __init__(self, args, word_dict, char_dict, feature_dict, state_dict=None, normalize=True): # Book-keeping. self.args = args self.word_dict = word_dict self.char_dict = char_dict self.args.vocab_size = len(word_dict) self.args.char_size = len(char_dict) self.feature_dict = feature_dict self.args.num_features = len(feature_dict) self.updates = 0 self.use_cuda = False self.parallel = False # Building network. If normalize if false, scores are not normalized # 0-1 per paragraph (no softmax). if args.model_type == 'rnn': self.network = RnnDocReader(args, normalize) elif args.model_type == 'r_net': self.network = R_Net(args, normalize) elif args.model_type == 'mnemonic': self.network = MnemonicReader(args, normalize) else: raise RuntimeError('Unsupported model: %s' % args.model_type) # Load saved state if state_dict: # Load buffer separately if 'fixed_embedding' in state_dict: fixed_embedding = state_dict.pop('fixed_embedding') self.network.load_state_dict(state_dict) self.network.register_buffer('fixed_embedding', fixed_embedding) else: self.network.load_state_dict(state_dict) def expand_dictionary(self, words): """Add words to the DocReader dictionary if they do not exist. The underlying embedding matrix is also expanded (with random embeddings). Args: words: iterable of tokens to add to the dictionary. Output: added: set of tokens that were added. """ to_add = {self.word_dict.normalize(w) for w in words if w not in self.word_dict} # Add words to dictionary and expand embedding layer if len(to_add) > 0: logger.info('Adding %d new words to dictionary...' % len(to_add)) for w in to_add: self.word_dict.add(w) self.args.vocab_size = len(self.word_dict) logger.info('New vocab size: %d' % len(self.word_dict)) old_embedding = self.network.embedding.weight.data self.network.embedding = torch.nn.Embedding(self.args.vocab_size, self.args.embedding_dim, padding_idx=0) new_embedding = self.network.embedding.weight.data new_embedding[:old_embedding.size(0)] = old_embedding # Return added words return to_add def expand_char_dictionary(self, chars): """Add chars to the DocReader dictionary if they do not exist. The underlying embedding matrix is also expanded (with random embeddings). Args: chars: iterable of tokens to add to the dictionary. Output: added: set of tokens that were added. """ to_add = {self.char_dict.normalize(w) for w in chars if w not in self.char_dict} # Add chars to dictionary and expand embedding layer if len(to_add) > 0: logger.info('Adding %d new chars to dictionary...' % len(to_add)) for w in to_add: self.char_dict.add(w) self.args.char_size = len(self.char_dict) logger.info('New char size: %d' % len(self.char_dict)) old_char_embedding = self.network.char_embedding.weight.data self.network.char_embedding = torch.nn.Embedding(self.args.char_size, self.args.char_embedding_dim, padding_idx=0) new_char_embedding = self.network.char_embedding.weight.data new_char_embedding[:old_char_embedding.size(0)] = old_char_embedding # Return added chars return to_add def load_embeddings(self, words, embedding_file): """Load pretrained embeddings for a given list of words, if they exist. Args: words: iterable of tokens. Only those that are indexed in the dictionary are kept. embedding_file: path to text file of embeddings, space separated. """ words = {w for w in words if w in self.word_dict} logger.info('Loading pre-trained embeddings for %d words from %s' % (len(words), embedding_file)) embedding = self.network.embedding.weight.data # When normalized, some words are duplicated. (Average the embeddings). vec_counts = {} with open(embedding_file) as f: for line in f: parsed = line.rstrip().split(' ') assert(len(parsed) == embedding.size(1) + 1) w = self.word_dict.normalize(parsed[0]) if w in words: vec = torch.Tensor([float(i) for i in parsed[1:]]) if w not in vec_counts: vec_counts[w] = 1 embedding[self.word_dict[w]].copy_(vec) else: logging.warning( 'WARN: Duplicate embedding found for %s' % w ) vec_counts[w] = vec_counts[w] + 1 embedding[self.word_dict[w]].add_(vec) for w, c in vec_counts.items(): embedding[self.word_dict[w]].div_(c) logger.info('Loaded %d embeddings (%.2f%%)' % (len(vec_counts), 100 * len(vec_counts) / len(words))) def load_char_embeddings(self, chars, char_embedding_file): """Load pretrained embeddings for a given list of chars, if they exist. Args: chars: iterable of tokens. Only those that are indexed in the dictionary are kept. char_embedding_file: path to text file of embeddings, space separated. """ chars = {w for w in chars if w in self.char_dict} logger.info('Loading pre-trained embeddings for %d chars from %s' % (len(chars), char_embedding_file)) char_embedding = self.network.char_embedding.weight.data # When normalized, some chars are duplicated. (Average the embeddings). vec_counts = {} with open(char_embedding_file) as f: for line in f: parsed = line.rstrip().split(' ') assert(len(parsed) == char_embedding.size(1) + 1) w = self.char_dict.normalize(parsed[0]) if w in chars: vec = torch.Tensor([float(i) for i in parsed[1:]]) if w not in vec_counts: vec_counts[w] = 1 char_embedding[self.char_dict[w]].copy_(vec) else: logging.warning( 'WARN: Duplicate char embedding found for %s' % w ) vec_counts[w] = vec_counts[w] + 1 char_embedding[self.char_dict[w]].add_(vec) for w, c in vec_counts.items(): char_embedding[self.char_dict[w]].div_(c) logger.info('Loaded %d char embeddings (%.2f%%)' % (len(vec_counts), 100 * len(vec_counts) / len(chars))) def tune_embeddings(self, words): """Unfix the embeddings of a list of words. This is only relevant if only some of the embeddings are being tuned (tune_partial = N). Shuffles the N specified words to the front of the dictionary, and saves the original vectors of the other N + 1:vocab words in a fixed buffer. Args: words: iterable of tokens contained in dictionary. """ words = {w for w in words if w in self.word_dict} if len(words) == 0: logger.warning('Tried to tune embeddings, but no words given!') return if len(words) == len(self.word_dict): logger.warning('Tuning ALL embeddings in dictionary') return # Shuffle words and vectors embedding = self.network.embedding.weight.data for idx, swap_word in enumerate(words, self.word_dict.START): # Get current word + embedding for this index curr_word = self.word_dict[idx] curr_emb = embedding[idx].clone() old_idx = self.word_dict[swap_word] # Swap embeddings + dictionary indices embedding[idx].copy_(embedding[old_idx]) embedding[old_idx].copy_(curr_emb) self.word_dict[swap_word] = idx self.word_dict[idx] = swap_word self.word_dict[curr_word] = old_idx self.word_dict[old_idx] = curr_word # Save the original, fixed embeddings self.network.register_buffer( 'fixed_embedding', embedding[idx + 1:].clone() ) def init_optimizer(self, state_dict=None): """Initialize an optimizer for the free parameters of the network. Args: state_dict: network parameters """ if self.args.fix_embeddings: for p in self.network.embedding.parameters(): p.requires_grad = False parameters = [p for p in self.network.parameters() if p.requires_grad] if self.args.optimizer == 'sgd': self.optimizer = optim.SGD(parameters, lr=self.args.learning_rate, momentum=self.args.momentum, weight_decay=self.args.weight_decay) elif self.args.optimizer == 'adamax': self.optimizer = optim.Adamax(parameters, weight_decay=self.args.weight_decay) elif self.args.optimizer == 'adadelta': self.optimizer = optim.Adadelta(parameters, lr=self.args.learning_rate, rho=self.args.rho, eps=self.args.eps, weight_decay=self.args.weight_decay) else: raise RuntimeError('Unsupported optimizer: %s' % self.args.optimizer) # -------------------------------------------------------------------------- # Learning # -------------------------------------------------------------------------- def update(self, ex): """Forward a batch of examples; step the optimizer to update weights.""" if not self.optimizer: raise RuntimeError('No optimizer set.') # Train mode self.network.train() # Transfer to GPU if self.use_cuda: inputs = [e if e is None else Variable(e.cuda(async=True)) for e in ex[:-3]] target_s = Variable(ex[-3].cuda(async=True)) target_e = Variable(ex[-2].cuda(async=True)) else: inputs = [e if e is None else Variable(e) for e in ex[:-3]] target_s = Variable(ex[-3]) target_e = Variable(ex[-2]) # Run forward score_s, score_e = self.network(*inputs) # Compute loss and accuracies loss = F.nll_loss(score_s, target_s) + F.nll_loss(score_e, target_e) # Clear gradients and run backward self.optimizer.zero_grad() loss.backward() # Clip gradients torch.nn.utils.clip_grad_norm(self.network.parameters(), self.args.grad_clipping) # Update parameters self.optimizer.step() self.updates += 1 # Reset any partially fixed parameters (e.g. rare words) self.reset_parameters() return loss.data[0], ex[0].size(0) def reset_parameters(self): """Reset any partially fixed parameters to original states.""" # Reset fixed embeddings to original value if self.args.tune_partial > 0: # Embeddings to fix are indexed after the special + N tuned words offset = self.args.tune_partial + self.word_dict.START if self.parallel: embedding = self.network.module.embedding.weight.data fixed_embedding = self.network.module.fixed_embedding else: embedding = self.network.embedding.weight.data fixed_embedding = self.network.fixed_embedding if offset < embedding.size(0): embedding[offset:] = fixed_embedding # -------------------------------------------------------------------------- # Prediction # -------------------------------------------------------------------------- def predict(self, ex, candidates=None, top_n=1, async_pool=None): """Forward a batch of examples only to get predictions. Args: ex: the batch candidates: batch * variable length list of string answer options. The model will only consider exact spans contained in this list. top_n: Number of predictions to return per batch element. async_pool: If provided, non-gpu post-processing will be offloaded to this CPU process pool. Output: pred_s: batch * top_n predicted start indices pred_e: batch * top_n predicted end indices pred_score: batch * top_n prediction scores If async_pool is given, these will be AsyncResult handles. """ # Eval mode self.network.eval() # Transfer to GPU if self.use_cuda: inputs = [e if e is None else Variable(e.cuda(async=True), volatile=True) for e in ex[:8]] else: inputs = [e if e is None else Variable(e, volatile=True) for e in ex[:8]] # Run forward score_s, score_e = self.network(*inputs) del inputs # Decode predictions score_s = score_s.data.cpu() score_e = score_e.data.cpu() if candidates: args = (score_s, score_e, candidates, top_n, self.args.max_len) if async_pool: return async_pool.apply_async(self.decode_candidates, args) else: return self.decode_candidates(*args) else: args = (score_s, score_e, top_n, self.args.max_len) if async_pool: return async_pool.apply_async(self.decode, args) else: return self.decode(*args) @staticmethod def decode(score_s, score_e, top_n=1, max_len=None): """Take argmax of constrained score_s * score_e. Args: score_s: independent start predictions score_e: independent end predictions top_n: number of top scored pairs to take max_len: max span length to consider """ pred_s = [] pred_e = [] pred_score = [] max_len = max_len or score_s.size(1) for i in range(score_s.size(0)): # Outer product of scores to get full p_s * p_e matrix scores = torch.ger(score_s[i], score_e[i]) # Zero out negative length and over-length span scores scores.triu_().tril_(max_len - 1) # Take argmax or top n scores = scores.numpy() scores_flat = scores.flatten() if top_n == 1: idx_sort = [np.argmax(scores_flat)] elif len(scores_flat) < top_n: idx_sort = np.argsort(-scores_flat) else: idx = np.argpartition(-scores_flat, top_n)[0:top_n] idx_sort = idx[np.argsort(-scores_flat[idx])] s_idx, e_idx = np.unravel_index(idx_sort, scores.shape) pred_s.append(s_idx) pred_e.append(e_idx) pred_score.append(scores_flat[idx_sort]) del score_s, score_e return pred_s, pred_e, pred_score @staticmethod def decode_candidates(score_s, score_e, candidates, top_n=1, max_len=None): """Take argmax of constrained score_s * score_e. Except only consider spans that are in the candidates list. """ pred_s = [] pred_e = [] pred_score = [] for i in range(score_s.size(0)): # Extract original tokens stored with candidates tokens = candidates[i]['input'] cands = candidates[i]['cands'] if not cands: # try getting from globals? (multiprocessing in pipeline mode) from ..pipeline.wrmcqa import PROCESS_CANDS cands = PROCESS_CANDS if not cands: raise RuntimeError('No candidates given.') # Score all valid candidates found in text. # Brute force get all ngrams and compare against the candidate list. max_len = max_len or len(tokens) scores, s_idx, e_idx = [], [], [] for s, e in tokens.ngrams(n=max_len, as_strings=False): span = tokens.slice(s, e).untokenize() if span in cands or span.lower() in cands: # Match! Record its score. scores.append(score_s[i][s] * score_e[i][e - 1]) s_idx.append(s) e_idx.append(e - 1) if len(scores) == 0: # No candidates present pred_s.append([]) pred_e.append([]) pred_score.append([]) else: # Rank found candidates scores = np.array(scores) s_idx = np.array(s_idx) e_idx = np.array(e_idx) idx_sort = np.argsort(-scores)[0:top_n] pred_s.append(s_idx[idx_sort]) pred_e.append(e_idx[idx_sort]) pred_score.append(scores[idx_sort]) del score_s, score_e, candidates return pred_s, pred_e, pred_score # -------------------------------------------------------------------------- # Saving and loading # -------------------------------------------------------------------------- def save(self, filename): state_dict = copy.copy(self.network.state_dict()) if 'fixed_embedding' in state_dict: state_dict.pop('fixed_embedding') params = { 'state_dict': state_dict, 'word_dict': self.word_dict, 'char_dict': self.char_dict, 'feature_dict': self.feature_dict, 'args': self.args, } try: torch.save(params, filename) except BaseException: logger.warning('WARN: Saving failed... continuing anyway.') def checkpoint(self, filename, epoch): params = { 'state_dict': self.network.state_dict(), 'word_dict': self.word_dict, 'char_dict': self.char_dict, 'feature_dict': self.feature_dict, 'args': self.args, 'epoch': epoch, 'optimizer': self.optimizer.state_dict(), } try: torch.save(params, filename) except BaseException: logger.warning('WARN: Saving failed... continuing anyway.') @staticmethod def load(filename, new_args=None, normalize=True): logger.info('Loading model %s' % filename) saved_params = torch.load( filename, map_location=lambda storage, loc: storage ) word_dict = saved_params['word_dict'] try: char_dict = saved_params['char_dict'] except KeyError as e: char_dict = Dictionary() feature_dict = saved_params['feature_dict'] state_dict = saved_params['state_dict'] args = saved_params['args'] if new_args: args = override_model_args(args, new_args) return DocReader(args, word_dict, char_dict, feature_dict, state_dict, normalize) @staticmethod def load_checkpoint(filename, normalize=True): logger.info('Loading model %s' % filename) saved_params = torch.load( filename, map_location=lambda storage, loc: storage ) word_dict = saved_params['word_dict'] char_dict = saved_params['char_dict'] feature_dict = saved_params['feature_dict'] state_dict = saved_params['state_dict'] epoch = saved_params['epoch'] optimizer = saved_params['optimizer'] args = saved_params['args'] model = DocReader(args, word_dict, char_dict, feature_dict, state_dict, normalize) model.init_optimizer(optimizer) return model, epoch # -------------------------------------------------------------------------- # Runtime # -------------------------------------------------------------------------- def cuda(self): self.use_cuda = True self.network = self.network.cuda() def cpu(self): self.use_cuda = False self.network = self.network.cpu() def parallelize(self): """Use data parallel to copy the model across several gpus. This will take all gpus visible with CUDA_VISIBLE_DEVICES. """ self.parallel = True self.network = torch.nn.DataParallel(self.network) ================================================ FILE: predictor.py ================================================ #!/usr/bin/env python3 # Copyright 2018-present, HKUST-KnowComp. # All rights reserved. # # This source code is licensed under the license found in the # LICENSE file in the root directory of this source tree. """Machine Comprehension predictor""" import logging from multiprocessing import Pool as ProcessPool from multiprocessing.util import Finalize from vector import vectorize, batchify from model import DocReader import utils from spacy_tokenizer import SpacyTokenizer logger = logging.getLogger(__name__) # ------------------------------------------------------------------------------ # Tokenize + annotate # ------------------------------------------------------------------------------ TOK = None def init(options): global TOK TOK = SpacyTokenizer(**options) Finalize(TOK, TOK.shutdown, exitpriority=100) def tokenize(text): global TOK return TOK.tokenize(text) def get_annotators_for_model(model): annotators = set() if model.args.use_pos: annotators.add('pos') if model.args.use_lemma: annotators.add('lemma') if model.args.use_ner: annotators.add('ner') return annotators # ------------------------------------------------------------------------------ # Predictor class. # ------------------------------------------------------------------------------ class Predictor(object): """Load a pretrained DocReader model and predict inputs on the fly.""" def __init__(self, model, normalize=True, embedding_file=None, char_embedding_file=None, num_workers=None): """ Args: model: path to saved model file. normalize: squash output score to 0-1 probabilities with a softmax. embedding_file: if provided, will expand dictionary to use all available pretrained vectors in this file. num_workers: number of CPU processes to use to preprocess batches. """ logger.info('Initializing model...') self.model = DocReader.load(model, normalize=normalize) if embedding_file: logger.info('Expanding dictionary...') utils.index_embedding_words(embedding_file) added_words = self.model.expand_dictionary(words) self.model.load_embeddings(added_words, embedding_file) if char_embedding_file: logger.info('Expanding dictionary...') chars = utils.index_embedding_chars(char_embedding_file) added_chars = self.model.expand_char_dictionary(chars) self.model.load_char_embeddings(added_chars, char_embedding_file) logger.info('Initializing tokenizer...') annotators = get_annotators_for_model(self.model) if num_workers is None or num_workers > 0: self.workers = ProcessPool( num_workers, initializer=init, initargs=({'annotators': annotators},), ) else: self.workers = None self.tokenizer = SpacyTokenizer(annotators=annotators) def predict(self, document, question, candidates=None, top_n=1): """Predict a single document - question pair.""" results = self.predict_batch([(document, question, candidates,)], top_n) return results[0] def predict_batch(self, batch, top_n=1): """Predict a batch of document - question pairs.""" documents, questions, candidates = [], [], [] for b in batch: documents.append(b[0]) questions.append(b[1]) candidates.append(b[2] if len(b) == 3 else None) candidates = candidates if any(candidates) else None # Tokenize the inputs, perhaps multi-processed. if self.workers: q_tokens = self.workers.map_async(tokenize, questions) c_tokens = self.workers.map_async(tokenize, documents) q_tokens = list(q_tokens.get()) c_tokens = list(c_tokens.get()) else: q_tokens = list(map(self.tokenizer.tokenize, questions)) c_tokens = list(map(self.tokenizer.tokenize, documents)) examples = [] for i in range(len(questions)): examples.append({ 'id': i, 'question': q_tokens[i].words(), 'question_char': q_tokens[i].chars(), 'qlemma': q_tokens[i].lemmas(), 'qpos': q_tokens[i].pos(), 'qner': q_tokens[i].entities(), 'document': c_tokens[i].words(), 'document_char': c_tokens[i].chars(), 'clemma': c_tokens[i].lemmas(), 'cpos': c_tokens[i].pos(), 'cner': c_tokens[i].entities(), }) # Stick document tokens in candidates for decoding if candidates: candidates = [{'input': c_tokens[i], 'cands': candidates[i]} for i in range(len(candidates))] # Build the batch and run it through the model batch_exs = batchify([vectorize(e, self.model) for e in examples]) s, e, score = self.model.predict(batch_exs, candidates, top_n) # Retrieve the predicted spans results = [] for i in range(len(s)): predictions = [] for j in range(len(s[i])): span = c_tokens[i].slice(s[i][j], e[i][j] + 1).untokenize() predictions.append((span, score[i][j])) results.append(predictions) return results def cuda(self): self.model.cuda() def cpu(self): self.model.cpu() ================================================ FILE: r_net.py ================================================ #!/usr/bin/env python3 # Copyright 2018-present, HKUST-KnowComp. # All rights reserved. # # This source code is licensed under the license found in the # LICENSE file in the root directory of this source tree. """Implementation of the R-Net based reader.""" import torch import torch.nn as nn import torch.nn.functional as F import layers from torch.autograd import Variable # ------------------------------------------------------------------------------ # Network # ------------------------------------------------------------------------------ class R_Net(nn.Module): RNN_TYPES = {'lstm': nn.LSTM, 'gru': nn.GRU, 'rnn': nn.RNN} CELL_TYPES = {'lstm': nn.LSTMCell, 'gru': nn.GRUCell, 'rnn': nn.RNNCell} def __init__(self, args, normalize=True): super(R_Net, self).__init__() # Store config self.args = args # Word embeddings (+1 for padding) self.embedding = nn.Embedding(args.vocab_size, args.embedding_dim, padding_idx=0) # Char embeddings (+1 for padding) self.char_embedding = nn.Embedding(args.char_size, args.char_embedding_dim, padding_idx=0) # Char rnn to generate char features self.char_rnn = layers.StackedBRNN( input_size=args.char_embedding_dim, hidden_size=args.char_hidden_size, num_layers=1, dropout_rate=args.dropout_rnn, dropout_output=args.dropout_rnn_output, concat_layers=False, rnn_type=self.RNN_TYPES[args.rnn_type], padding=False, ) doc_input_size = args.embedding_dim + args.char_hidden_size * 2 # Encoder self.encode_rnn = layers.StackedBRNN( input_size=doc_input_size, hidden_size=args.hidden_size, num_layers=args.doc_layers, dropout_rate=args.dropout_rnn, dropout_output=args.dropout_rnn_output, concat_layers=args.concat_rnn_layers, rnn_type=self.RNN_TYPES[args.rnn_type], padding=args.rnn_padding, ) # Output sizes of rnn encoder doc_hidden_size = 2 * args.hidden_size question_hidden_size = 2 * args.hidden_size if args.concat_rnn_layers: doc_hidden_size *= args.doc_layers question_hidden_size *= args.question_layers # Gated-attention-based RNN of the whole question self.question_attn = layers.SeqAttnMatch(question_hidden_size, identity=False) self.question_attn_gate = layers.Gate(doc_hidden_size + question_hidden_size) self.question_attn_rnn = layers.StackedBRNN( input_size=doc_hidden_size + question_hidden_size, hidden_size=args.hidden_size, num_layers=1, dropout_rate=args.dropout_rnn, dropout_output=args.dropout_rnn_output, concat_layers=False, rnn_type=self.RNN_TYPES[args.rnn_type], padding=args.rnn_padding, ) question_attn_hidden_size = 2 * args.hidden_size # Self-matching-attention-baed RNN of the whole doc self.doc_self_attn = layers.SelfAttnMatch(question_attn_hidden_size, identity=False) self.doc_self_attn_gate = layers.Gate(question_attn_hidden_size + question_attn_hidden_size) self.doc_self_attn_rnn = layers.StackedBRNN( input_size=question_attn_hidden_size + question_attn_hidden_size, hidden_size=args.hidden_size, num_layers=1, dropout_rate=args.dropout_rnn, dropout_output=args.dropout_rnn_output, concat_layers=False, rnn_type=self.RNN_TYPES[args.rnn_type], padding=args.rnn_padding, ) doc_self_attn_hidden_size = 2 * args.hidden_size self.doc_self_attn_rnn2 = layers.StackedBRNN( input_size=doc_self_attn_hidden_size, hidden_size=args.hidden_size, num_layers=1, dropout_rate=args.dropout_rnn, dropout_output=args.dropout_rnn_output, concat_layers=False, rnn_type=self.RNN_TYPES[args.rnn_type], padding=args.rnn_padding, ) self.ptr_net = layers.PointerNetwork( x_size = doc_self_attn_hidden_size, y_size = question_hidden_size, hidden_size = args.hidden_size, dropout_rate=args.dropout_rnn, cell_type=nn.GRUCell, normalize=normalize ) def forward(self, x1, x1_c, x1_f, x1_mask, x2, x2_c, x2_f, x2_mask): """Inputs: x1 = document word indices [batch * len_d] x1_c = document char indices [batch * len_d] x1_f = document word features indices [batch * len_d * nfeat] x1_mask = document padding mask [batch * len_d] x2 = question word indices [batch * len_q] x2_c = document char indices [batch * len_d] x1_f = document word features indices [batch * len_d * nfeat] x2_mask = question padding mask [batch * len_q] """ # Embed both document and question x1_emb = self.embedding(x1) x2_emb = self.embedding(x2) x1_c_emb = self.char_embedding(x1_c) x2_c_emb = self.char_embedding(x2_c) # Dropout on embeddings if self.args.dropout_emb > 0: x1_emb = F.dropout(x1_emb, p=self.args.dropout_emb, training=self.training) x2_emb = F.dropout(x2_emb, p=self.args.dropout_emb, training=self.training) x1_c_emb = F.dropout(x1_c_emb, p=self.args.dropout_emb, training=self.training) x2_c_emb = F.dropout(x2_c_emb, p=self.args.dropout_emb, training=self.training) # Generate char features x1_c_features = self.char_rnn( x1_c_emb.reshape((x1_c_emb.size(0) * x1_c_emb.size(1), x1_c_emb.size(2), x1_c_emb.size(3))), x1_mask.unsqueeze(2).repeat(1, 1, x1_c_emb.size(2)).reshape((x1_c_emb.size(0) * x1_c_emb.size(1), x1_c_emb.size(2))) ).reshape((x1_c_emb.size(0), x1_c_emb.size(1), x1_c_emb.size(2), -1))[:,:,-1,:] x2_c_features = self.char_rnn( x2_c_emb.reshape((x2_c_emb.size(0) * x2_c_emb.size(1), x2_c_emb.size(2), x2_c_emb.size(3))), x2_mask.unsqueeze(2).repeat(1, 1, x2_c_emb.size(2)).reshape((x2_c_emb.size(0) * x2_c_emb.size(1), x2_c_emb.size(2))) ).reshape((x2_c_emb.size(0), x2_c_emb.size(1), x2_c_emb.size(2), -1))[:,:,-1,:] # Combine input crnn_input = [x1_emb, x1_c_features] qrnn_input = [x2_emb, x2_c_features] # Encode document with RNN c = self.encode_rnn(torch.cat(crnn_input, 2), x1_mask) # Encode question with RNN q = self.encode_rnn(torch.cat(qrnn_input, 2), x2_mask) # Match questions to docs question_attn_hiddens = self.question_attn(c, q, x2_mask) rnn_input = self.question_attn_gate(torch.cat([c, question_attn_hiddens], 2)) c = self.question_attn_rnn(rnn_input, x1_mask) # Match documents to themselves doc_self_attn_hiddens = self.doc_self_attn(c, x1_mask) rnn_input = self.doc_self_attn_gate(torch.cat([c, doc_self_attn_hiddens], 2)) c = self.doc_self_attn_rnn(rnn_input, x1_mask) c = self.doc_self_attn_rnn2(c, x1_mask) # Predict start_scores, end_scores = self.ptr_net(c, q, x1_mask, x2_mask) return start_scores, end_scores ================================================ FILE: rnn_reader.py ================================================ #!/usr/bin/env python3 # Copyright 2017-present, Facebook, Inc. # All rights reserved. # # This source code is licensed under the license found in the # LICENSE file in the root directory of this source tree. """Implementation of the RNN based DrQA reader.""" import torch import torch.nn as nn import layers # ------------------------------------------------------------------------------ # Network # ------------------------------------------------------------------------------ class RnnDocReader(nn.Module): RNN_TYPES = {'lstm': nn.LSTM, 'gru': nn.GRU, 'rnn': nn.RNN} CELL_TYPES = {'lstm': nn.LSTMCell, 'gru': nn.GRUCell, 'rnn': nn.RNNCell} def __init__(self, args, normalize=True): super(RnnDocReader, self).__init__() # Store config self.args = args # Word embeddings (+1 for padding) self.embedding = nn.Embedding(args.vocab_size, args.embedding_dim, padding_idx=0) # Projection for attention weighted question if args.use_qemb: self.qemb_match = layers.SeqAttnMatch(args.embedding_dim) # Input size to RNN: word emb + question emb + manual features doc_input_size = args.embedding_dim + args.num_features if args.use_qemb: doc_input_size += args.embedding_dim # RNN document encoder self.doc_rnn = layers.StackedBRNN( input_size=doc_input_size, hidden_size=args.hidden_size, num_layers=args.doc_layers, dropout_rate=args.dropout_rnn, dropout_output=args.dropout_rnn_output, concat_layers=args.concat_rnn_layers, rnn_type=self.RNN_TYPES[args.rnn_type], padding=args.rnn_padding, ) # RNN question encoder self.question_rnn = layers.StackedBRNN( input_size=args.embedding_dim, hidden_size=args.hidden_size, num_layers=args.question_layers, dropout_rate=args.dropout_rnn, dropout_output=args.dropout_rnn_output, concat_layers=args.concat_rnn_layers, rnn_type=self.RNN_TYPES[args.rnn_type], padding=args.rnn_padding, ) # Output sizes of rnn encoders doc_hidden_size = 2 * args.hidden_size question_hidden_size = 2 * args.hidden_size if args.concat_rnn_layers: doc_hidden_size *= args.doc_layers question_hidden_size *= args.question_layers # Question merging if args.question_merge not in ['avg', 'self_attn']: raise NotImplementedError('merge_mode = %s' % args.merge_mode) if args.question_merge == 'self_attn': self.self_attn = layers.LinearSeqAttn(question_hidden_size) # Bilinear attention for span start/end self.start_attn = layers.BilinearSeqAttn( doc_hidden_size, question_hidden_size, normalize=normalize, ) self.end_attn = layers.BilinearSeqAttn( doc_hidden_size, question_hidden_size, normalize=normalize, ) def forward(self, x1, x1_c, x1_f, x1_mask, x2, x2_c, x2_f, x2_mask): """Inputs: x1 = document word indices [batch * len_d] x1_f = document word features indices [batch * len_d * nfeat] x1_mask = document padding mask [batch * len_d] x2 = question word indices [batch * len_q] x2_mask = question padding mask [batch * len_q] """ # Embed both document and question x1_emb = self.embedding(x1) x2_emb = self.embedding(x2) # Dropout on embeddings if self.args.dropout_emb > 0: x1_emb = nn.functional.dropout(x1_emb, p=self.args.dropout_emb, training=self.training) x2_emb = nn.functional.dropout(x2_emb, p=self.args.dropout_emb, training=self.training) # Form document encoding inputs drnn_input = [x1_emb] # Add attention-weighted question representation if self.args.use_qemb: x2_weighted_emb = self.qemb_match(x1_emb, x2_emb, x2_mask) drnn_input.append(x2_weighted_emb) # Add manual features if self.args.num_features > 0: drnn_input.append(x1_f) # Encode document with RNN doc_hiddens = self.doc_rnn(torch.cat(drnn_input, 2), x1_mask) # Encode question with RNN + merge hiddens question_hiddens = self.question_rnn(x2_emb, x2_mask) if self.args.question_merge == 'avg': q_merge_weights = layers.uniform_weights(question_hiddens, x2_mask) elif self.args.question_merge == 'self_attn': q_merge_weights = self.self_attn(question_hiddens, x2_mask) question_hidden = layers.weighted_avg(question_hiddens, q_merge_weights) # Predict start and end positions start_scores = self.start_attn(doc_hiddens, question_hidden, x1_mask) end_scores = self.end_attn(doc_hiddens, question_hidden, x1_mask) return start_scores, end_scores ================================================ FILE: script/evaluate-v1.1.py ================================================ """ Official evaluation script for v1.1 of the SQuAD dataset. """ from __future__ import print_function from collections import Counter import string import re import argparse import json import sys def normalize_answer(s): """Lower text and remove punctuation, articles and extra whitespace.""" def remove_articles(text): return re.sub(r'\b(a|an|the)\b', ' ', text) def white_space_fix(text): return ' '.join(text.split()) def remove_punc(text): exclude = set(string.punctuation) return ''.join(ch for ch in text if ch not in exclude) def lower(text): return text.lower() return white_space_fix(remove_articles(remove_punc(lower(s)))) def f1_score(prediction, ground_truth): prediction_tokens = normalize_answer(prediction).split() ground_truth_tokens = normalize_answer(ground_truth).split() common = Counter(prediction_tokens) & Counter(ground_truth_tokens) num_same = sum(common.values()) if num_same == 0: return 0 precision = 1.0 * num_same / len(prediction_tokens) recall = 1.0 * num_same / len(ground_truth_tokens) f1 = (2 * precision * recall) / (precision + recall) return f1 def exact_match_score(prediction, ground_truth): return (normalize_answer(prediction) == normalize_answer(ground_truth)) def metric_max_over_ground_truths(metric_fn, prediction, ground_truths): scores_for_ground_truths = [] for ground_truth in ground_truths: score = metric_fn(prediction, ground_truth) scores_for_ground_truths.append(score) return max(scores_for_ground_truths) def evaluate(dataset, predictions): f1 = exact_match = total = 0 for article in dataset: for paragraph in article['paragraphs']: for qa in paragraph['qas']: total += 1 if qa['id'] not in predictions: message = 'Unanswered question ' + qa['id'] + \ ' will receive score 0.' print(message, file=sys.stderr) continue ground_truths = list(map(lambda x: x['text'], qa['answers'])) prediction = predictions[qa['id']] exact_match += metric_max_over_ground_truths( exact_match_score, prediction, ground_truths) f1 += metric_max_over_ground_truths( f1_score, prediction, ground_truths) exact_match = 100.0 * exact_match / total f1 = 100.0 * f1 / total return {'exact_match': exact_match, 'f1': f1} if __name__ == '__main__': expected_version = '1.1' parser = argparse.ArgumentParser( description='Evaluation for SQuAD ' + expected_version) parser.add_argument('dataset_file', help='Dataset file') parser.add_argument('prediction_file', help='Prediction File') args = parser.parse_args() with open(args.dataset_file) as dataset_file: dataset_json = json.load(dataset_file) if (dataset_json['version'] != expected_version): print('Evaluation expects v-' + expected_version + ', but got dataset with v-' + dataset_json['version'], file=sys.stderr) dataset = dataset_json['data'] with open(args.prediction_file) as prediction_file: predictions = json.load(prediction_file) print(json.dumps(evaluate(dataset, predictions))) ================================================ FILE: script/interactive.py ================================================ #!/usr/bin/env python3 # Copyright 2018-present, HKUST-KnowComp. # All rights reserved. # # This source code is licensed under the license found in the # LICENSE file in the root directory of this source tree. """A script to run the reader model interactively.""" import sys sys.path.append('.') import torch import code import argparse import logging import prettytable import time from predictor import Predictor from multiprocessing import cpu_count logger = logging.getLogger() logger.setLevel(logging.INFO) fmt = logging.Formatter('%(asctime)s: [ %(message)s ]', '%m/%d/%Y %I:%M:%S %p') console = logging.StreamHandler() console.setFormatter(fmt) logger.addHandler(console) PREDICTOR = None # ------------------------------------------------------------------------------ # Drop in to interactive mode # ------------------------------------------------------------------------------ def process(document, question, candidates=None, top_n=1): t0 = time.time() predictions = PREDICTOR.predict(document, question, candidates, top_n) table = prettytable.PrettyTable(['Rank', 'Span', 'Score']) for i, p in enumerate(predictions, 1): table.add_row([i, p[0], p[1]]) print(table) print('Time: %.4f' % (time.time() - t0)) banner = """ * WRMCQA interactive Document Reader Module * * Repo: Mnemonic Reader (https://github.com/HKUST-KnowComp/MnemonicReader) * Implement based on Facebook's DrQA >>> process(document, question, candidates=None, top_n=1) >>> usage() """ def usage(): print(banner) # ------------------------------------------------------------------------------ # Commandline arguments & init # ------------------------------------------------------------------------------ if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--model', type=str, default=None, help='Path to model to use') parser.add_argument('--embedding-file', type=str, default=None, help=('Expand dictionary to use all pretrained ' 'embeddings in this file.')) parser.add_argument('--char-embedding-file', type=str, default=None, help=('Expand dictionary to use all pretrained ' 'char embeddings in this file.')) parser.add_argument('--num-workers', type=int, default=int(cpu_count()/2), help='Number of CPU processes (for tokenizing, etc)') parser.add_argument('--no-cuda', action='store_true', help='Use CPU only') parser.add_argument('--gpu', type=int, default=-1, help='Specify GPU device id to use') parser.add_argument('--no-normalize', action='store_true', help='Do not softmax normalize output scores.') args = parser.parse_args() args.cuda = not args.no_cuda and torch.cuda.is_available() if args.cuda: torch.cuda.set_device(args.gpu) logger.info('CUDA enabled (GPU %d)' % args.gpu) else: logger.info('Running on CPU only.') PREDICTOR = Predictor( args.model, normalize=not args.no_normalize, embedding_file=args.embedding_file, char_embedding_file=args.char_embedding_file, num_workers=args.num_workers, ) if args.cuda: PREDICTOR.cuda() code.interact(banner=banner, local=locals()) ================================================ FILE: script/predict.py ================================================ #!/usr/bin/env python3 # Copyright 2017-present, Facebook, Inc. # All rights reserved. # # This source code is licensed under the license found in the # LICENSE file in the root directory of this source tree. """A script to make and save model predictions on an input dataset.""" import sys sys.path.append('.') import os import time import torch import argparse import logging try: import ujson as json except ImportError: import json from tqdm import tqdm from predictor import Predictor from multiprocessing import cpu_count logger = logging.getLogger() logger.setLevel(logging.INFO) fmt = logging.Formatter('%(asctime)s: [ %(message)s ]', '%m/%d/%Y %I:%M:%S %p') console = logging.StreamHandler() console.setFormatter(fmt) logger.addHandler(console) parser = argparse.ArgumentParser() parser.add_argument('dataset', type=str, default=None, help='SQuAD-like dataset to evaluate on') parser.add_argument('--model', type=str, default=None, help='Path to model to use') parser.add_argument('--embedding-file', type=str, default=None, help=('Expand dictionary to use all pretrained ' 'embeddings in this file.')) parser.add_argument('--char-embedding-file', type=str, default=None, help=('Expand dictionary to use all pretrained ' 'char embeddings in this file.')) parser.add_argument('--out-dir', type=str, default='data/predict', help=('Directory to write prediction file to ' '(-.preds)')) parser.add_argument('--num-workers', type=int, default=int(cpu_count()/2), help='Number of CPU processes (for tokenizing, etc)') parser.add_argument('--no-cuda', action='store_true', help='Use CPU only') parser.add_argument('--gpu', type=int, default=-1, help='Specify GPU device id to use') parser.add_argument('--batch-size', type=int, default=128, help='Example batching size') parser.add_argument('--top-n', type=int, default=1, help='Store top N predicted spans per example') parser.add_argument('--official', type=bool, default=True, help='Only store single top span instead of top N list') args = parser.parse_args() t0 = time.time() args.cuda = not args.no_cuda and torch.cuda.is_available() if args.cuda: torch.cuda.set_device(args.gpu) logger.info('CUDA enabled (GPU %d)' % args.gpu) else: logger.info('Running on CPU only.') predictor = Predictor( args.model, normalize=True, embedding_file=args.embedding_file, char_embedding_file=args.char_embedding_file, num_workers=args.num_workers, ) if args.cuda: predictor.cuda() # ------------------------------------------------------------------------------ # Read in dataset and make predictions. # ------------------------------------------------------------------------------ examples = [] qids = [] with open(args.dataset) as f: data = json.load(f)['data'] for article in data: for paragraph in article['paragraphs']: context = paragraph['context'] for qa in paragraph['qas']: qids.append(qa['id']) examples.append((context, qa['question'])) results = {} for i in tqdm(range(0, len(examples), args.batch_size)): predictions = predictor.predict_batch( examples[i:i + args.batch_size], top_n=args.top_n ) for j in range(len(predictions)): # Official eval expects just a qid --> span if args.official: results[qids[i + j]] = predictions[j][0][0] # Otherwise we store top N and scores for debugging. else: results[qids[i + j]] = [(p[0], float(p[1])) for p in predictions[j]] model = os.path.splitext(os.path.basename(args.model or 'default'))[0] basename = os.path.splitext(os.path.basename(args.dataset))[0] outfile = os.path.join(args.out_dir, basename + '-' + model + '.preds') if not os.path.isdir(args.out_dir): os.mkdir(args.out_dir) logger.info('Writing results to %s' % outfile) with open(outfile, 'w') as f: json.dump(results, f) logger.info('Total time: %.2f' % (time.time() - t0)) ================================================ FILE: script/preprocess.py ================================================ #!/usr/bin/env python3 # Copyright 2017-present, Facebook, Inc. # All rights reserved. # # This source code is licensed under the license found in the # LICENSE file in the root directory of this source tree. """Preprocess the SQuAD dataset for training.""" import sys sys.path.append('.') import argparse import os try: import ujson as json except ImportError: import json import time from multiprocessing import Pool, cpu_count from multiprocessing.util import Finalize from functools import partial from spacy_tokenizer import SpacyTokenizer # ------------------------------------------------------------------------------ # Tokenize + annotate. # ------------------------------------------------------------------------------ TOK = None ANNTOTORS = {'lemma', 'pos', 'ner'} def init(): global TOK TOK = SpacyTokenizer(annotators=ANNTOTORS) Finalize(TOK, TOK.shutdown, exitpriority=100) def tokenize(text): """Call the global process tokenizer on the input text.""" global TOK tokens = TOK.tokenize(text) output = { 'words': tokens.words(), 'chars': tokens.chars(), 'offsets': tokens.offsets(), 'pos': tokens.pos(), 'lemma': tokens.lemmas(), 'ner': tokens.entities(), } return output # ------------------------------------------------------------------------------ # Process dataset examples # ------------------------------------------------------------------------------ def load_dataset(path): """Load json file and store fields separately.""" with open(path) as f: data = json.load(f)['data'] output = {'qids': [], 'questions': [], 'answers': [], 'contexts': [], 'qid2cid': []} for article in data: for paragraph in article['paragraphs']: output['contexts'].append(paragraph['context']) for qa in paragraph['qas']: output['qids'].append(qa['id']) output['questions'].append(qa['question']) output['qid2cid'].append(len(output['contexts']) - 1) if 'answers' in qa: output['answers'].append(qa['answers']) return output def find_answer(offsets, begin_offset, end_offset): """Match token offsets with the char begin/end offsets of the answer.""" start = [i for i, tok in enumerate(offsets) if tok[0] == begin_offset] end = [i for i, tok in enumerate(offsets) if tok[1] == end_offset] assert(len(start) <= 1) assert(len(end) <= 1) if len(start) == 1 and len(end) == 1: return start[0], end[0] def process_dataset(data, tokenizer, workers=None): """Iterate processing (tokenize, parse, etc) dataset multithreaded.""" make_pool = partial(Pool, workers, initializer=init) workers = make_pool(initargs=()) q_tokens = workers.map(tokenize, data['questions']) workers.close() workers.join() workers = make_pool(initargs=()) c_tokens = workers.map(tokenize, data['contexts']) workers.close() workers.join() for idx in range(len(data['qids'])): question = q_tokens[idx]['words'] question_char = q_tokens[idx]['chars'] qlemma = q_tokens[idx]['lemma'] qpos = q_tokens[idx]['pos'] qner = q_tokens[idx]['ner'] document = c_tokens[data['qid2cid'][idx]]['words'] document_char = c_tokens[data['qid2cid'][idx]]['chars'] offsets = c_tokens[data['qid2cid'][idx]]['offsets'] clemma = c_tokens[data['qid2cid'][idx]]['lemma'] cpos = c_tokens[data['qid2cid'][idx]]['pos'] cner = c_tokens[data['qid2cid'][idx]]['ner'] ans_tokens = [] if len(data['answers']) > 0: for ans in data['answers'][idx]: found = find_answer(offsets, ans['answer_start'], ans['answer_start'] + len(ans['text'])) if found: ans_tokens.append(found) yield { 'id': data['qids'][idx], 'question': question, 'question_char': question_char, 'document': document, 'document_char': document_char, 'offsets': offsets, 'answers': ans_tokens, 'qlemma': qlemma, 'qpos': qpos, 'qner': qner, 'clemma': clemma, 'cpos': cpos, 'cner': cner, } # ----------------------------------------------------------------------------- # Commandline options # ----------------------------------------------------------------------------- parser = argparse.ArgumentParser() parser.add_argument('data_dir', type=str, help='Path to SQuAD data directory') parser.add_argument('out_dir', type=str, help='Path to output file dir') parser.add_argument('--split', type=str, help='Filename for train/dev split') parser.add_argument('--num-workers', type=int, default=1) parser.add_argument('--tokenizer', type=str, default='spacy') args = parser.parse_args() t0 = time.time() in_file = os.path.join(args.data_dir, args.split + '.json') print('Loading dataset %s' % in_file, file=sys.stderr) dataset = load_dataset(in_file) out_file = os.path.join( args.out_dir, '%s-processed-%s.txt' % (args.split, args.tokenizer) ) print('Will write to file %s' % out_file, file=sys.stderr) with open(out_file, 'w') as f: for ex in process_dataset(dataset, args.tokenizer, args.num_workers): f.write(json.dumps(ex) + '\n') print('Total time: %.4f (s)' % (time.time() - t0)) ================================================ FILE: script/train.py ================================================ #!/usr/bin/env python3 # Copyright 2018-present, HKUST-KnowComp. # All rights reserved. # # This source code is licensed under the license found in the # LICENSE file in the root directory of this source tree. """Main reader training script.""" import sys sys.path.append('.') import argparse import torch import numpy as np try: import ujson as json except ImportError: import json import os import subprocess import logging import utils, vector, config, data from model import DocReader logger = logging.getLogger() # ------------------------------------------------------------------------------ # Training arguments. # ------------------------------------------------------------------------------ # Defaults DATA_DIR = os.path.join('data', 'datasets') MODEL_DIR = os.path.join('data', 'models') EMBED_DIR = os.path.join('data', 'embeddings') def str2bool(v): return v.lower() in ('yes', 'true', 't', '1', 'y') def add_train_args(parser): """Adds commandline arguments pertaining to training a model. These are different from the arguments dictating the model architecture. """ parser.register('type', 'bool', str2bool) # Runtime environment runtime = parser.add_argument_group('Environment') runtime.add_argument('--no-cuda', type='bool', default=False, help='Train on CPU, even if GPUs are available.') runtime.add_argument('--gpu', type=int, default=-1, help='Run on a specific GPU') runtime.add_argument('--data-workers', type=int, default=5, help='Number of subprocesses for data loading') runtime.add_argument('--parallel', type='bool', default=False, help='Use DataParallel on all available GPUs') runtime.add_argument('--random-seed', type=int, default=1013, help=('Random seed for all numpy/torch/cuda ' 'operations (for reproducibility)')) runtime.add_argument('--num-epochs', type=int, default=40, help='Train data iterations') runtime.add_argument('--batch-size', type=int, default=45, help='Batch size for training') runtime.add_argument('--test-batch-size', type=int, default=32, help='Batch size during validation/testing') # Files files = parser.add_argument_group('Filesystem') files.add_argument('--model-dir', type=str, default=MODEL_DIR, help='Directory for saved models/checkpoints/logs') files.add_argument('--model-name', type=str, default='', help='Unique model identifier (.mdl, .txt, .checkpoint)') files.add_argument('--data-dir', type=str, default=DATA_DIR, help='Directory of training/validation data') files.add_argument('--train-file', type=str, default='SQuAD-v1.1-train-processed-spacy.txt', help='Preprocessed train file') files.add_argument('--dev-file', type=str, default='SQuAD-v1.1-dev-processed-spacy.txt', help='Preprocessed dev file') files.add_argument('--dev-json', type=str, default='SQuAD-v1.1-dev.json', help=('Unprocessed dev file to run validation ' 'while training on')) files.add_argument('--embed-dir', type=str, default=EMBED_DIR, help='Directory of pre-trained embedding files') files.add_argument('--embedding-file', type=str, default='glove.840B.300d.txt', help='Space-separated pretrained embeddings file') files.add_argument('--char-embedding-file', type=str, default='glove.840B.300d-char.txt', help='Space-separated pretrained embeddings file') # Saving + loading save_load = parser.add_argument_group('Saving/Loading') save_load.add_argument('--checkpoint', type='bool', default=False, help='Save model + optimizer state after each epoch') save_load.add_argument('--pretrained', type=str, default='', help='Path to a pretrained model to warm-start with') save_load.add_argument('--expand-dictionary', type='bool', default=False, help='Expand dictionary of pretrained model to ' + 'include training/dev words of new data') # Data preprocessing preprocess = parser.add_argument_group('Preprocessing') preprocess.add_argument('--uncased-question', type='bool', default=False, help='Question words will be lower-cased') preprocess.add_argument('--uncased-doc', type='bool', default=False, help='Document words will be lower-cased') preprocess.add_argument('--restrict-vocab', type='bool', default=True, help='Only use pre-trained words in embedding_file') # General general = parser.add_argument_group('General') general.add_argument('--official-eval', type='bool', default=True, help='Validate with official SQuAD eval') general.add_argument('--valid-metric', type=str, default='exact_match', help='The evaluation metric used for model selection: None, exact_match, f1') general.add_argument('--display-iter', type=int, default=25, help='Log state after every epochs') general.add_argument('--sort-by-len', type='bool', default=True, help='Sort batches by length for speed') def set_defaults(args): """Make sure the commandline arguments are initialized properly.""" # Check critical files exist args.dev_json = os.path.join(args.data_dir, args.dev_json) if not os.path.isfile(args.dev_json): raise IOError('No such file: %s' % args.dev_json) args.train_file = os.path.join(args.data_dir, args.train_file) if not os.path.isfile(args.train_file): raise IOError('No such file: %s' % args.train_file) args.dev_file = os.path.join(args.data_dir, args.dev_file) if not os.path.isfile(args.dev_file): raise IOError('No such file: %s' % args.dev_file) if args.embedding_file: args.embedding_file = os.path.join(args.embed_dir, args.embedding_file) if not os.path.isfile(args.embedding_file): raise IOError('No such file: %s' % args.embedding_file) if args.char_embedding_file: args.char_embedding_file = os.path.join(args.embed_dir, args.char_embedding_file) if not os.path.isfile(args.char_embedding_file): raise IOError('No such file: %s' % args.char_embedding_file) # Set model directory subprocess.call(['mkdir', '-p', args.model_dir]) # Set model name if not args.model_name: import uuid import time args.model_name = time.strftime("%Y%m%d-") + str(uuid.uuid4())[:8] # Set log + model file names args.log_file = os.path.join(args.model_dir, args.model_name + '.txt') args.model_file = os.path.join(args.model_dir, args.model_name + '.mdl') # Embeddings options if args.embedding_file: with open(args.embedding_file) as f: dim = len(f.readline().strip().split(' ')) - 1 args.embedding_dim = dim elif not args.embedding_dim: raise RuntimeError('Either embedding_file or embedding_dim ' 'needs to be specified.') if args.char_embedding_file: with open(args.char_embedding_file) as f: dim = len(f.readline().strip().split(' ')) - 1 args.char_embedding_dim = dim elif not args.char_embedding_dim: raise RuntimeError('Either char_embedding_file or char_embedding_dim ' 'needs to be specified.') # Make sure tune_partial and fix_embeddings are consistent. if args.tune_partial > 0 and args.fix_embeddings: logger.warning('WARN: fix_embeddings set to False as tune_partial > 0.') args.fix_embeddings = False # Make sure fix_embeddings and embedding_file are consistent if args.fix_embeddings: if not (args.embedding_file or args.pretrained): logger.warning('WARN: fix_embeddings set to False ' 'as embeddings are random.') args.fix_embeddings = False return args # ------------------------------------------------------------------------------ # Initalization from scratch. # ------------------------------------------------------------------------------ def init_from_scratch(args, train_exs, dev_exs): """New model, new data, new dictionary.""" # Create a feature dict out of the annotations in the data logger.info('-' * 100) logger.info('Generate features') feature_dict = utils.build_feature_dict(args, train_exs) logger.info('Num features = %d' % len(feature_dict)) logger.info(feature_dict) # Build a dictionary from the data questions + documents (train/dev splits) logger.info('-' * 100) logger.info('Build word dictionary') word_dict = utils.build_word_dict(args, train_exs + dev_exs) logger.info('Num words = %d' % len(word_dict)) # Build a char dictionary from the data questions + documents (train/dev splits) logger.info('-' * 100) logger.info('Build char dictionary') char_dict = utils.build_char_dict(args, train_exs + dev_exs) logger.info('Num chars = %d' % len(char_dict)) # Initialize model model = DocReader(config.get_model_args(args), word_dict, char_dict, feature_dict) # Load pretrained embeddings for words in dictionary if args.embedding_file: model.load_embeddings(word_dict.tokens(), args.embedding_file) if args.char_embedding_file: model.load_char_embeddings(char_dict.tokens(), args.char_embedding_file) return model # ------------------------------------------------------------------------------ # Train loop. # ------------------------------------------------------------------------------ def train(args, data_loader, model, global_stats): """Run through one epoch of model training with the provided data loader.""" # Initialize meters + timers train_loss = utils.AverageMeter() epoch_time = utils.Timer() # Run one epoch for idx, ex in enumerate(data_loader): train_loss.update(*model.update(ex)) if idx % args.display_iter == 0: logger.info('train: Epoch = %d | iter = %d/%d | ' % (global_stats['epoch'], idx, len(data_loader)) + 'loss = %.2f | elapsed time = %.2f (s)' % (train_loss.avg, global_stats['timer'].time())) train_loss.reset() logger.info('train: Epoch %d done. Time for epoch = %.2f (s)' % (global_stats['epoch'], epoch_time.time())) # Checkpoint if args.checkpoint: model.checkpoint(args.model_file + '.checkpoint', global_stats['epoch'] + 1) # ------------------------------------------------------------------------------ # Validation loops. Includes both "unofficial" and "official" functions that # use different metrics and implementations. # ------------------------------------------------------------------------------ def validate_unofficial(args, data_loader, model, global_stats, mode): """Run one full unofficial validation. Unofficial = doesn't use SQuAD script. """ eval_time = utils.Timer() start_acc = utils.AverageMeter() end_acc = utils.AverageMeter() exact_match = utils.AverageMeter() # Make predictions examples = 0 for ex in data_loader: batch_size = ex[0].size(0) pred_s, pred_e, _ = model.predict(ex) target_s, target_e = ex[-3:-1] # We get metrics for independent start/end and joint start/end accuracies = eval_accuracies(pred_s, target_s, pred_e, target_e) start_acc.update(accuracies[0], batch_size) end_acc.update(accuracies[1], batch_size) exact_match.update(accuracies[2], batch_size) # If getting train accuracies, sample max 10k examples += batch_size if mode == 'train' and examples >= 1e4: break logger.info('%s valid unofficial: Epoch = %d | start = %.2f | ' % (mode, global_stats['epoch'], start_acc.avg) + 'end = %.2f | exact = %.2f | examples = %d | ' % (end_acc.avg, exact_match.avg, examples) + 'valid time = %.2f (s)' % eval_time.time()) return {'exact_match': exact_match.avg} def validate_official(args, data_loader, model, global_stats, offsets, texts, answers): """Run one full official validation. Uses exact spans and same exact match/F1 score computation as in the SQuAD script. Extra arguments: offsets: The character start/end indices for the tokens in each context. texts: Map of qid --> raw text of examples context (matches offsets). answers: Map of qid --> list of accepted answers. """ eval_time = utils.Timer() f1 = utils.AverageMeter() exact_match = utils.AverageMeter() # Run through examples examples = 0 for ex in data_loader: ex_id, batch_size = ex[-1], ex[0].size(0) pred_s, pred_e, _ = model.predict(ex) for i in range(batch_size): s_offset = offsets[ex_id[i]][pred_s[i][0]][0] e_offset = offsets[ex_id[i]][pred_e[i][0]][1] prediction = texts[ex_id[i]][s_offset:e_offset] # Compute metrics ground_truths = answers[ex_id[i]] exact_match.update(utils.metric_max_over_ground_truths( utils.exact_match_score, prediction, ground_truths)) f1.update(utils.metric_max_over_ground_truths( utils.f1_score, prediction, ground_truths)) examples += batch_size logger.info('dev valid official: Epoch = %d | EM = %.2f | ' % (global_stats['epoch'], exact_match.avg * 100) + 'F1 = %.2f | examples = %d | valid time = %.2f (s)' % (f1.avg * 100, examples, eval_time.time())) return {'exact_match': exact_match.avg * 100, 'f1': f1.avg * 100} def eval_accuracies(pred_s, target_s, pred_e, target_e): """An unofficial evalutation helper. Compute exact start/end/complete match accuracies for a batch. """ # Convert 1D tensors to lists of lists (compatibility) if torch.is_tensor(target_s): target_s = [[e] for e in target_s] target_e = [[e] for e in target_e] # Compute accuracies from targets batch_size = len(pred_s) start = utils.AverageMeter() end = utils.AverageMeter() em = utils.AverageMeter() for i in range(batch_size): # Start matches if pred_s[i] in target_s[i]: start.update(1) else: start.update(0) # End matches if pred_e[i] in target_e[i]: end.update(1) else: end.update(0) # Both start and end match if any([1 for _s, _e in zip(target_s[i], target_e[i]) if _s == torch.from_numpy(pred_s[i]) and _e == torch.from_numpy(pred_e[i])]): em.update(1) else: em.update(0) return start.avg * 100, end.avg * 100, em.avg * 100 # ------------------------------------------------------------------------------ # Main. # ------------------------------------------------------------------------------ def main(args): # -------------------------------------------------------------------------- # DATA logger.info('-' * 100) logger.info('Load data files') train_exs = utils.load_data(args, args.train_file, skip_no_answer=True) logger.info('Num train examples = %d' % len(train_exs)) dev_exs = utils.load_data(args, args.dev_file) logger.info('Num dev examples = %d' % len(dev_exs)) # If we are doing offician evals then we need to: # 1) Load the original text to retrieve spans from offsets. # 2) Load the (multiple) text answers for each question. if args.official_eval: dev_texts = utils.load_text(args.dev_json) dev_offsets = {ex['id']: ex['offsets'] for ex in dev_exs} dev_answers = utils.load_answers(args.dev_json) # -------------------------------------------------------------------------- # MODEL logger.info('-' * 100) start_epoch = 0 if args.checkpoint and os.path.isfile(args.model_file + '.checkpoint'): # Just resume training, no modifications. logger.info('Found a checkpoint...') checkpoint_file = args.model_file + '.checkpoint' model, start_epoch = DocReader.load_checkpoint(checkpoint_file, args) else: # Training starts fresh. But the model state is either pretrained or # newly (randomly) initialized. if args.pretrained: logger.info('Using pretrained model...') model = DocReader.load(args.pretrained, args) if args.expand_dictionary: logger.info('Expanding dictionary for new data...') # Add words in training + dev examples words = utils.load_words(args, train_exs + dev_exs) added_words = model.expand_dictionary(words) # Load pretrained embeddings for added words if args.embedding_file: model.load_embeddings(added_words, args.embedding_file) logger.info('Expanding char dictionary for new data...') # Add words in training + dev examples chars = utils.load_chars(args, train_exs + dev_exs) added_chars = model.expand_char_dictionary(chars) # Load pretrained embeddings for added words if args.char_embedding_file: model.load_char_embeddings(added_chars, args.char_embedding_file) else: logger.info('Training model from scratch...') model = init_from_scratch(args, train_exs, dev_exs) # Set up partial tuning of embeddings if args.tune_partial > 0: logger.info('-' * 100) logger.info('Counting %d most frequent question words' % args.tune_partial) top_words = utils.top_question_words( args, train_exs, model.word_dict ) for word in top_words[:5]: logger.info(word) logger.info('...') for word in top_words[-6:-1]: logger.info(word) model.tune_embeddings([w[0] for w in top_words]) # Set up optimizer model.init_optimizer() # Use the GPU? if args.cuda: model.cuda() # Use multiple GPUs? if args.parallel: model.parallelize() # -------------------------------------------------------------------------- # DATA ITERATORS # Two datasets: train and dev. If we sort by length it's faster. logger.info('-' * 100) logger.info('Make data loaders') train_dataset = data.ReaderDataset(train_exs, model, single_answer=True) if args.sort_by_len: train_sampler = data.SortedBatchSampler(train_dataset.lengths(), args.batch_size, shuffle=True) else: train_sampler = torch.utils.data.sampler.RandomSampler(train_dataset) train_loader = torch.utils.data.DataLoader( train_dataset, batch_size=args.batch_size, sampler=train_sampler, num_workers=args.data_workers, collate_fn=vector.batchify, pin_memory=args.cuda, ) dev_dataset = data.ReaderDataset(dev_exs, model, single_answer=False) if args.sort_by_len: dev_sampler = data.SortedBatchSampler(dev_dataset.lengths(), args.test_batch_size, shuffle=False) else: dev_sampler = torch.utils.data.sampler.SequentialSampler(dev_dataset) dev_loader = torch.utils.data.DataLoader( dev_dataset, batch_size=args.test_batch_size, sampler=dev_sampler, num_workers=args.data_workers, collate_fn=vector.batchify, pin_memory=args.cuda, ) # ------------------------------------------------------------------------- # PRINT CONFIG logger.info('-' * 100) logger.info('CONFIG:\n%s' % json.dumps(vars(args), indent=4, sort_keys=True)) # -------------------------------------------------------------------------- # TRAIN/VALID LOOP logger.info('-' * 100) logger.info('Starting training...') stats = {'timer': utils.Timer(), 'epoch': 0, 'best_valid': 0} for epoch in range(start_epoch, args.num_epochs): stats['epoch'] = epoch # Train train(args, train_loader, model, stats) # Validate unofficial (train) validate_unofficial(args, train_loader, model, stats, mode='train') # Validate unofficial (dev) result = validate_unofficial(args, dev_loader, model, stats, mode='dev') # Validate official if args.official_eval: result = validate_official(args, dev_loader, model, stats, dev_offsets, dev_texts, dev_answers) # Save best valid if args.valid_metric is None or args.valid_metric == 'None': model.save(args.model_file) elif result[args.valid_metric] > stats['best_valid']: logger.info('Best valid: %s = %.2f (epoch %d, %d updates)' % (args.valid_metric, result[args.valid_metric], stats['epoch'], model.updates)) model.save(args.model_file) stats['best_valid'] = result[args.valid_metric] if __name__ == '__main__': # Parse cmdline args and setup environment parser = argparse.ArgumentParser( 'WRMCQA Document Reader', formatter_class=argparse.ArgumentDefaultsHelpFormatter ) add_train_args(parser) config.add_model_args(parser) args = parser.parse_args() set_defaults(args) # Set cuda args.cuda = not args.no_cuda and torch.cuda.is_available() if args.cuda: torch.cuda.set_device(args.gpu) # Set random state np.random.seed(args.random_seed) torch.manual_seed(args.random_seed) if args.cuda: torch.cuda.manual_seed(args.random_seed) # Set logging logger.setLevel(logging.INFO) fmt = logging.Formatter('%(asctime)s: [ %(message)s ]', '%m/%d/%Y %I:%M:%S %p') console = logging.StreamHandler() console.setFormatter(fmt) logger.addHandler(console) if args.log_file: if args.checkpoint: logfile = logging.FileHandler(args.log_file, 'a') else: logfile = logging.FileHandler(args.log_file, 'w') logfile.setFormatter(fmt) logger.addHandler(logfile) logger.info('COMMAND: %s' % ' '.join(sys.argv)) print(args) # Run! main(args) ================================================ FILE: spacy_tokenizer.py ================================================ #!/usr/bin/env python3 # Copyright 2018-present, HKUST-KnowComp. # All rights reserved. # # This source code is licensed under the license found in the # LICENSE file in the root directory of this source tree. """Tokenizer that is backed by spaCy (spacy.io). Requires spaCy package and the spaCy english model. """ import spacy import copy class Tokens(object): """A class to represent a list of tokenized text.""" TEXT = 0 CHAR = 1 TEXT_WS = 2 SPAN = 3 POS = 4 LEMMA = 5 NER = 6 def __init__(self, data, annotators, opts=None): self.data = data self.annotators = annotators self.opts = opts or {} def __len__(self): """The number of tokens.""" return len(self.data) def slice(self, i=None, j=None): """Return a view of the list of tokens from [i, j).""" new_tokens = copy.copy(self) new_tokens.data = self.data[i: j] return new_tokens def untokenize(self): """Returns the original text (with whitespace reinserted).""" return ''.join([t[self.TEXT_WS] for t in self.data]).strip() def chars(self, uncased=False): """Returns a list of the first character of each token Args: uncased: lower cases characters """ if uncased: return [[c.lower() for c in t[self.CHAR]] for t in self.data] else: return [[c for c in t[self.CHAR]] for t in self.data] def words(self, uncased=False): """Returns a list of the text of each token Args: uncased: lower cases text """ if uncased: return [t[self.TEXT].lower() for t in self.data] else: return [t[self.TEXT] for t in self.data] def offsets(self): """Returns a list of [start, end) character offsets of each token.""" return [t[self.SPAN] for t in self.data] def pos(self): """Returns a list of part-of-speech tags of each token. Returns None if this annotation was not included. """ if 'pos' not in self.annotators: return None return [t[self.POS] for t in self.data] def lemmas(self): """Returns a list of the lemmatized text of each token. Returns None if this annotation was not included. """ if 'lemma' not in self.annotators: return None return [t[self.LEMMA] for t in self.data] def entities(self): """Returns a list of named-entity-recognition tags of each token. Returns None if this annotation was not included. """ if 'ner' not in self.annotators: return None return [t[self.NER] for t in self.data] def ngrams(self, n=1, uncased=False, filter_fn=None, as_strings=True): """Returns a list of all ngrams from length 1 to n. Args: n: upper limit of ngram length uncased: lower cases text filter_fn: user function that takes in an ngram list and returns True or False to keep or not keep the ngram as_string: return the ngram as a string vs list """ def _skip(gram): if not filter_fn: return False return filter_fn(gram) words = self.words(uncased) ngrams = [(s, e + 1) for s in range(len(words)) for e in range(s, min(s + n, len(words))) if not _skip(words[s:e + 1])] # Concatenate into strings if as_strings: ngrams = ['{}'.format(' '.join(words[s:e])) for (s, e) in ngrams] return ngrams def entity_groups(self): """Group consecutive entity tokens with the same NER tag.""" entities = self.entities() if not entities: return None non_ent = self.opts.get('non_ent', 'O') groups = [] idx = 0 while idx < len(entities): ner_tag = entities[idx] # Check for entity tag if ner_tag != non_ent: # Chomp the sequence start = idx while (idx < len(entities) and entities[idx] == ner_tag): idx += 1 groups.append((self.slice(start, idx).untokenize(), ner_tag)) else: idx += 1 return groups class SpacyTokenizer(object): def __init__(self, **kwargs): """ Args: annotators: set that can include pos, lemma, and ner. model: spaCy model to use (either path, or keyword like 'en'). """ model = kwargs.get('model', 'en') self.annotators = copy.deepcopy(kwargs.get('annotators', set())) self.nlp = spacy.load(model) self.nlp.remove_pipe('parser') if not any([p in self.annotators for p in ['lemma', 'pos', 'ner']]): self.nlp.remove_pipe('tagger') if 'ner' not in self.annotators: self.nlp.remove_pipe('ner') def tokenize(self, text): # We don't treat new lines as tokens. clean_text = text.replace('\n', ' ') tokens = self.nlp(clean_text) data = [] for i in range(len(tokens)): # Get whitespace start_ws = tokens[i].idx if i + 1 < len(tokens): end_ws = tokens[i + 1].idx else: end_ws = tokens[i].idx + len(tokens[i].text) data.append(( tokens[i].text, list(tokens[i].text), text[start_ws: end_ws], (tokens[i].idx, tokens[i].idx + len(tokens[i].text)), tokens[i].tag_, tokens[i].lemma_, tokens[i].ent_type_, )) # Set special option for non-entity tag: '' vs 'O' in spaCy return Tokens(data, self.annotators, opts={'non_ent': ''}) def shutdown(self): pass def __del__(self): self.shutdown() ================================================ FILE: utils.py ================================================ #!/usr/bin/env python3 # Copyright 2018-present, HKUST-KnowComp. # All rights reserved. # # This source code is licensed under the license found in the # LICENSE file in the root directory of this source tree. """Reader utilities.""" try: import ujson as json except ImportError: import json import time import logging import string try: import regex as re except ImportError: import re from collections import Counter from data import Dictionary logger = logging.getLogger(__name__) # ------------------------------------------------------------------------------ # Data loading # ------------------------------------------------------------------------------ def load_data(args, filename, skip_no_answer=False): """Load examples from preprocessed file. One example per line, JSON encoded. """ # Load JSON lines with open(filename) as f: examples = [json.loads(line) for line in f] # Make case insensitive? if args.uncased_question or args.uncased_doc: for ex in examples: if args.uncased_question: ex['question'] = [w.lower() for w in ex['question']] ex['question_char'] = [w.lower() for w in ex['question_char']] if args.uncased_doc: ex['document'] = [w.lower() for w in ex['document']] ex['document_char'] = [w.lower() for w in ex['document_char']] # Skip unparsed (start/end) examples if skip_no_answer: examples = [ex for ex in examples if len(ex['answers']) > 0] return examples def load_text(filename): """Load the paragraphs only of a SQuAD dataset. Store as qid -> text.""" # Load JSON file with open(filename) as f: examples = json.load(f)['data'] texts = {} for article in examples: for paragraph in article['paragraphs']: for qa in paragraph['qas']: texts[qa['id']] = paragraph['context'] return texts def load_answers(filename): """Load the answers only of a SQuAD dataset. Store as qid -> [answers].""" # Load JSON file with open(filename) as f: examples = json.load(f)['data'] ans = {} for article in examples: for paragraph in article['paragraphs']: for qa in paragraph['qas']: ans[qa['id']] = list(map(lambda x: x['text'], qa['answers'])) return ans # ------------------------------------------------------------------------------ # Dictionary building # ------------------------------------------------------------------------------ def index_embedding_words(embedding_file): """Put all the words in embedding_file into a set.""" words = set() with open(embedding_file) as f: for line in f: w = Dictionary.normalize(line.rstrip().split(' ')[0]) words.add(w) return words def load_words(args, examples): """Iterate and index all the words in examples (documents + questions).""" def _insert(iterable): for w in iterable: w = Dictionary.normalize(w) if valid_words and w not in valid_words: continue words.add(w) if args.restrict_vocab and args.embedding_file: logger.info('Restricting to words in %s' % args.embedding_file) valid_words = index_embedding_words(args.embedding_file) logger.info('Num words in set = %d' % len(valid_words)) else: valid_words = None words = set() for ex in examples: _insert(ex['question']) _insert(ex['document']) return words def build_word_dict(args, examples): """Return a word dictionary from question and document words in provided examples. """ word_dict = Dictionary() for w in load_words(args, examples): word_dict.add(w) return word_dict def index_embedding_chars(char_embedding_file): """Put all the chars in char_embedding_file into a set.""" chars = set() with open(char_embedding_file) as f: for line in f: c = Dictionary.normalize(line.rstrip().split(' ')[0]) chars.add(c) return chars def load_chars(args, examples): """Iterate and index all the chars in examples (documents + questions).""" def _insert(iterable): for cs in iterable: for c in cs: c = Dictionary.normalize(c) if valid_chars and c not in valid_chars: continue chars.add(c) if args.restrict_vocab and args.char_embedding_file: logger.info('Restricting to chars in %s' % args.char_embedding_file) valid_chars = index_embedding_chars(args.char_embedding_file) logger.info('Num chars in set = %d' % len(valid_chars)) else: valid_chars = None chars = set() for ex in examples: _insert(ex['question_char']) _insert(ex['document_char']) return chars def build_char_dict(args, examples): """Return a char dictionary from question and document words in provided examples. """ char_dict = Dictionary() for c in load_chars(args, examples): char_dict.add(c) return char_dict def top_question_words(args, examples, word_dict): """Count and return the most common question words in provided examples.""" word_count = Counter() for ex in examples: for w in ex['question']: w = Dictionary.normalize(w) if w in word_dict: word_count.update([w]) return word_count.most_common(args.tune_partial) def build_feature_dict(args, examples): """Index features (one hot) from fields in examples and options.""" def _insert(feature): if feature not in feature_dict: feature_dict[feature] = len(feature_dict) feature_dict = {} # Exact match features if args.use_exact_match: _insert('in_cased') _insert('in_uncased') if args.use_lemma: _insert('in_lemma') # Part of speech tag features if args.use_pos: for ex in examples: for w in ex['cpos']: _insert('pos=%s' % w) for w in ex['qpos']: _insert('pos=%s' % w) # Named entity tag features if args.use_ner: for ex in examples: for w in ex['cner']: _insert('ner=%s' % w) for w in ex['qner']: _insert('ner=%s' % w) # Term frequency feature if args.use_tf: _insert('tf') return feature_dict # ------------------------------------------------------------------------------ # Evaluation. Follows official evalutation script for v1.1 of the SQuAD dataset. # ------------------------------------------------------------------------------ def normalize_answer(s): """Lower text and remove punctuation, articles and extra whitespace.""" def remove_articles(text): return re.sub(r'\b(a|an|the)\b', ' ', text) def white_space_fix(text): return ' '.join(text.split()) def remove_punc(text): exclude = set(string.punctuation) return ''.join(ch for ch in text if ch not in exclude) def lower(text): return text.lower() return white_space_fix(remove_articles(remove_punc(lower(s)))) def f1_score(prediction, ground_truth): """Compute the geometric mean of precision and recall for answer tokens.""" prediction_tokens = normalize_answer(prediction).split() ground_truth_tokens = normalize_answer(ground_truth).split() common = Counter(prediction_tokens) & Counter(ground_truth_tokens) num_same = sum(common.values()) if num_same == 0: return 0 precision = 1.0 * num_same / len(prediction_tokens) recall = 1.0 * num_same / len(ground_truth_tokens) f1 = (2 * precision * recall) / (precision + recall) return f1 def exact_match_score(prediction, ground_truth): """Check if the prediction is a (soft) exact match with the ground truth.""" return normalize_answer(prediction) == normalize_answer(ground_truth) def regex_match_score(prediction, pattern): """Check if the prediction matches the given regular expression.""" try: compiled = re.compile( pattern, flags=re.IGNORECASE + re.UNICODE + re.MULTILINE ) except BaseException: logger.warn('Regular expression failed to compile: %s' % pattern) return False return compiled.match(prediction) is not None def metric_max_over_ground_truths(metric_fn, prediction, ground_truths): """Given a prediction and multiple valid answers, return the score of the best prediction-answer_n pair given a metric function. """ scores_for_ground_truths = [] for ground_truth in ground_truths: score = metric_fn(prediction, ground_truth) scores_for_ground_truths.append(score) return max(scores_for_ground_truths) # ------------------------------------------------------------------------------ # Utility classes # ------------------------------------------------------------------------------ class AverageMeter(object): """Computes and stores the average and current value.""" def __init__(self): self.reset() def reset(self): self.val = 0 self.avg = 0 self.sum = 0 self.count = 0 def update(self, val, n=1): self.val = val self.sum += val * n self.count += n self.avg = self.sum / self.count class Timer(object): """Computes elapsed time.""" def __init__(self): self.running = True self.total = 0 self.start = time.time() def reset(self): self.running = True self.total = 0 self.start = time.time() return self def resume(self): if not self.running: self.running = True self.start = time.time() return self def stop(self): if self.running: self.running = False self.total += time.time() - self.start return self def time(self): if self.running: return self.total + time.time() - self.start return self.total ================================================ FILE: vector.py ================================================ #!/usr/bin/env python3 # Copyright 2018-present, HKUST-KnowComp. # All rights reserved. # # This source code is licensed under the license found in the # LICENSE file in the root directory of this source tree. """Functions for putting examples into torch format.""" from collections import Counter import torch def vectorize(ex, model, single_answer=False): """Torchify a single example.""" args = model.args word_dict = model.word_dict char_dict = model.char_dict feature_dict = model.feature_dict # Index words document = torch.LongTensor([word_dict[w] for w in ex['document']]) document_char = [torch.LongTensor([char_dict[c] for c in cs]) for cs in ex['document_char']] question = torch.LongTensor([word_dict[w] for w in ex['question']]) question_char = [torch.LongTensor([char_dict[c] for c in cs]) for cs in ex['question_char']] # Create extra features vector if len(feature_dict) > 0: c_features = torch.zeros(len(ex['document']), len(feature_dict)) q_features = torch.zeros(len(ex['question']), len(feature_dict)) else: c_features = None q_features = None # f_{exact_match} if args.use_exact_match: q_words_cased = {w for w in ex['question']} q_words_uncased = {w.lower() for w in ex['question']} q_lemma = {w for w in ex['qlemma']} if args.use_lemma else None for i in range(len(ex['document'])): if ex['document'][i] in q_words_cased: c_features[i][feature_dict['in_cased']] = 1.0 if ex['document'][i].lower() in q_words_uncased: c_features[i][feature_dict['in_uncased']] = 1.0 if q_lemma and ex['clemma'][i] in q_lemma: c_features[i][feature_dict['in_lemma']] = 1.0 c_words_cased = {w for w in ex['document']} c_words_uncased = {w.lower() for w in ex['document']} c_lemma = {w for w in ex['clemma']} if args.use_lemma else None for i in range(len(ex['question'])): if ex['question'][i] in c_words_cased: q_features[i][feature_dict['in_cased']] = 1.0 if ex['question'][i].lower() in c_words_uncased: q_features[i][feature_dict['in_uncased']] = 1.0 if c_lemma and ex['qlemma'][i] in c_lemma: q_features[i][feature_dict['in_lemma']] = 1.0 # f_{token} (POS) if args.use_pos: for i, w in enumerate(ex['cpos']): f = 'pos=%s' % w if f in feature_dict: c_features[i][feature_dict[f]] = 1.0 for i, w in enumerate(ex['qpos']): f = 'pos=%s' % w if f in feature_dict: q_features[i][feature_dict[f]] = 1.0 # f_{token} (NER) if args.use_ner: for i, w in enumerate(ex['cner']): f = 'ner=%s' % w if f in feature_dict: c_features[i][feature_dict[f]] = 1.0 for i, w in enumerate(ex['qner']): f = 'ner=%s' % w if f in feature_dict: q_features[i][feature_dict[f]] = 1.0 # f_{token} (TF) if args.use_tf: counter = Counter([w.lower() for w in ex['document']]) l = len(ex['document']) for i, w in enumerate(ex['document']): c_features[i][feature_dict['tf']] = counter[w.lower()] * 1.0 / l counter = Counter([w.lower() for w in ex['question']]) l = len(ex['question']) for i, w in enumerate(ex['question']): q_features[i][feature_dict['tf']] = counter[w.lower()] * 1.0 / l # Maybe return without target if 'answers' not in ex: return document, document_char, c_features, question, question_char, q_features, ex['id'] # ...or with target(s) (might still be empty if answers is empty) if single_answer: assert(len(ex['answers']) > 0) start = torch.LongTensor(1).fill_(ex['answers'][0][0]) end = torch.LongTensor(1).fill_(ex['answers'][0][1]) else: start = [a[0] for a in ex['answers']] end = [a[1] for a in ex['answers']] return document, document_char, c_features, question, question_char, q_features, start, end, ex['id'] def batchify(batch): """Gather a batch of individual examples into one batch.""" NUM_INPUTS = 6 NUM_TARGETS = 2 NUM_EXTRA = 1 docs = [ex[0] for ex in batch] doc_chars = [ex[1] for ex in batch] c_features = [ex[2] for ex in batch] questions = [ex[3] for ex in batch] question_chars = [ex[4] for ex in batch] q_features = [ex[5] for ex in batch] ids = [ex[-1] for ex in batch] # Batch documents and features max_length = max([d.size(0) for d in docs]) # max_char_length = max([c.size(0) for cs in doc_chars for c in cs]) max_char_length = 13 x1 = torch.LongTensor(len(docs), max_length).zero_() x1_c = torch.LongTensor(len(docs), max_length, max_char_length).zero_() x1_mask = torch.ByteTensor(len(docs), max_length).fill_(1) if c_features[0] is None: x1_f = None else: x1_f = torch.zeros(len(docs), max_length, c_features[0].size(1)) for i, d in enumerate(docs): x1[i, :d.size(0)].copy_(d) x1_mask[i, :d.size(0)].fill_(0) if x1_f is not None: x1_f[i, :d.size(0)].copy_(c_features[i]) for i, cs in enumerate(doc_chars): for j, c in enumerate(cs): c_ = c[:max_char_length] x1_c[i, j, :c_.size(0)].copy_(c_) # Batch questions max_length = max([q.size(0) for q in questions]) x2 = torch.LongTensor(len(questions), max_length).zero_() x2_c = torch.LongTensor(len(questions), max_length, max_char_length).zero_() x2_mask = torch.ByteTensor(len(questions), max_length).fill_(1) if q_features[0] is None: x2_f = None else: x2_f = torch.zeros(len(questions), max_length, q_features[0].size(1)) for i, d in enumerate(questions): x2[i, :d.size(0)].copy_(d) x2_mask[i, :d.size(0)].fill_(0) if x2_f is not None: x2_f[i, :d.size(0)].copy_(q_features[i]) for i, cs in enumerate(question_chars): for j, c in enumerate(cs): c_ = c[:max_char_length] x2_c[i, j, :c_.size(0)].copy_(c_) # Maybe return without targets if len(batch[0]) == NUM_INPUTS + NUM_EXTRA: return x1, x1_c, x1_f, x1_mask, x2, x2_c, x2_f, x2_mask, ids elif len(batch[0]) == NUM_INPUTS + NUM_EXTRA + NUM_TARGETS: # ...Otherwise add targets if torch.is_tensor(batch[0][NUM_INPUTS]): y_s = torch.cat([ex[NUM_INPUTS] for ex in batch]) y_e = torch.cat([ex[NUM_INPUTS+1] for ex in batch]) else: y_s = [ex[NUM_INPUTS] for ex in batch] y_e = [ex[NUM_INPUTS+1] for ex in batch] else: raise RuntimeError('Incorrect number of inputs per example.') return x1, x1_c, x1_f, x1_mask, x2, x2_c, x2_f, x2_mask, y_s, y_e, ids