Repository: HKUST-KnowComp/MnemonicReader
Branch: master
Commit: 76aeb1d9021e
Files: 19
Total size: 144.2 KB
Directory structure:
gitextract_qftjbr90/
├── .gitignore
├── LICENSE
├── README.md
├── config.py
├── data.py
├── layers.py
├── m_reader.py
├── model.py
├── predictor.py
├── r_net.py
├── rnn_reader.py
├── script/
│ ├── evaluate-v1.1.py
│ ├── interactive.py
│ ├── predict.py
│ ├── preprocess.py
│ └── train.py
├── spacy_tokenizer.py
├── utils.py
└── vector.py
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitignore
================================================
*.pyc
*.DS_Store
*~
data/
*.tar.gz
*.egg-info
================================================
FILE: LICENSE
================================================
BSD 3-Clause License
Copyright (c) 2018, HKUST-KnowComp
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
* Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
================================================
FILE: README.md
================================================
# Mnemonic Reader
The Mnemonic Reader is a deep learning model for Machine Comprehension task. You can get details from this [paper](https://arxiv.org/pdf/1705.02798.pdf). It combines advantages of [match-LSTM](https://arxiv.org/pdf/1608.07905), [R-Net](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/05/r-net.pdf) and [Document Reader](https://arxiv.org/abs/1704.00051) and utilizes a new unit, the Semantic Fusion Unit (SFU), to achieve state-of-the-art results (at that time).
This model is a [PyTorch](http://pytorch.org/) implementation of Mnemonic Reader. At the same time, a PyTorch implementation of R-Net and a PyTorch implementation of Document Reader are also included to compare with the Mnemonic Reader. Pretrained models are also available in [release](https://github.com/HKUST-KnowComp/MnemonicReader/releases).
This repo belongs to [HKUST-KnowComp](https://github.com/HKUST-KnowComp) and is under the [BSD LICENSE](LICENSE).
Some codes are implemented based on [DrQA](https://github.com/facebookresearch/DrQA).
Please feel free to contact with Xin Liu (xliucr@connect.ust.hk) if you have any question about this repo.
### Evaluation on SQuAD
| Model | DEV_EM | DEV_F1 |
| ------------------------------------- | ------ | ------ |
| Document Reader (original paper) | 69.5 | 78.8 |
| Document Reader (trained model) | 69.4 | 78.6 |
| R-Net (original paper 1) | 71.1 | 79.5 |
| R-Net (original paper 2) | 72.3 | 80.6 |
| R-Net (trained model) | 70.2 | 79.4 |
| Mnemonic Reader (original paper) | 71.8 | 81.2 |
| Mnemonic Reader + RL (original paper) | 72.1 | 81.6 |
| Mnemonic Reader (trained model) | 73.2 | 81.5 |

### Requirements
* Python >= 3.4
* PyTorch >= 0.31
* spaCy >= 2.0.0
* tqdm
* ujson
* numpy
* prettytable
### Prepare
First of all, you need to download the dataset and pre-trained word vectors.
```bash
mkdir -p data/datasets
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json -O data/datasets/SQuAD-train-v1.1.json
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json -O data/datasets/SQuAD-dev-v1.1.json
```
```bash
mkdir -p data/embeddings
wget http://nlp.stanford.edu/data/glove.840B.300d.zip -O data/embeddings/glove.840B.300d.zip
cd data/embeddings
unzip glove.840B.300d.zip
```
Then, you need to preprocess these data.
```bash
python script/preprocess.py data/datasets data/datasets --split SQuAD-train-v1.1
python script/preprocess.py data/datasets data/datasets --split SQuAD-dev-v1.1
```
If you want to use multicores to speed up, you could add `--num-workers 4` in commands.
### Train
There are some parameters to set but default values are ready. If you are not interested in tuning parameters, you can use default values. Just run:
```bash
python script/train.py
```
After several hours, you will get the model in `data/models/`, e.g. `20180416-acc9d06d.mdl` and you can see the log file in `data/models/`, e.g. `20180416-acc9d06d.txt`.
### Predict
To evaluate the model you get, you should complete this part.
```bash
python script/predict.py --model data/models/20180416-acc9d06d.mdl
```
You need to change the model name in the command above.
You will not get results directly but to use the official `evaluate-v1.1.py` in `data/script`.
```bash
python script/evaluate-v1.1.py data/predict/SQuAD-dev-v1.1-20180416-acc9d06d.preds data/datasets/SQuAD-dev-v1.1.json
```
### Interactivate
In order to help those who are interested in QA systems, `script/interactivate.py` provides an easy but good demo.
```bash
python script/interactivate.py --model data/models/20180416-acc9d06d.mdl
```
Then you will drop into an interactive session. It looks like:
```
* Interactive Module *
* Repo: Mnemonic Reader (https://github.com/HKUST-KnowComp/MnemonicReader)
* Implement based on Facebook's DrQA
>>> process(document, question, candidates=None, top_n=1)
>>> usage()
>>> text="Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend \"Venite Ad Me Omnes\". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary."
>>> question = "To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?"
>>> process(text, question)
+------+----------------------------+-----------+
| Rank | Span | Score |
+------+----------------------------+-----------+
| 1 | Saint Bernadette Soubirous | 0.9875301 |
+------+----------------------------+-----------+
```
### More parameters
If you want to tune parameters to achieve a higher score, you can get instructions about parameters via using
```bash
python script/preprocess.py --help
```
```bash
python script/train.py --help
```
```bash
python script/predict.py --help
```
```bash
python script/interactivate.py --help
```
## License
All codes in **Mnemonic Reader** are under [BSD LICENSE](LICENSE).
================================================
FILE: config.py
================================================
#!/usr/bin/env python3
# Copyright 2018-present, HKUST-KnowComp.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
"""Model architecture/optimization options for WRMCQA document reader."""
import argparse
import logging
logger = logging.getLogger(__name__)
# Index of arguments concerning the core model architecture
MODEL_ARCHITECTURE = {
'model_type', 'embedding_dim', 'char_embedding_dim', 'hidden_size', 'char_hidden_size',
'doc_layers', 'question_layers', 'rnn_type', 'concat_rnn_layers', 'question_merge',
'use_qemb', 'use_exact_match', 'use_pos', 'use_ner', 'use_lemma', 'use_tf', 'hop'
}
# Index of arguments concerning the model optimizer/training
MODEL_OPTIMIZER = {
'fix_embeddings', 'optimizer', 'learning_rate', 'momentum', 'weight_decay',
'rho', 'eps', 'max_len', 'grad_clipping', 'tune_partial',
'rnn_padding', 'dropout_rnn', 'dropout_rnn_output', 'dropout_emb'
}
def str2bool(v):
return v.lower() in ('yes', 'true', 't', '1', 'y')
def add_model_args(parser):
parser.register('type', 'bool', str2bool)
# Model architecture
model = parser.add_argument_group('Reader Model Architecture')
model.add_argument('--model-type', type=str, default='mnemonic',
help='Model architecture type: rnn, r_net, mnemonic')
model.add_argument('--embedding-dim', type=int, default=300,
help='Embedding size if embedding_file is not given')
model.add_argument('--char-embedding-dim', type=int, default=50,
help='Embedding size if char_embedding_file is not given')
model.add_argument('--hidden-size', type=int, default=100,
help='Hidden size of RNN units')
model.add_argument('--char-hidden-size', type=int, default=50,
help='Hidden size of char RNN units')
model.add_argument('--doc-layers', type=int, default=3,
help='Number of encoding layers for document')
model.add_argument('--question-layers', type=int, default=3,
help='Number of encoding layers for question')
model.add_argument('--rnn-type', type=str, default='lstm',
help='RNN type: LSTM, GRU, or RNN')
# Model specific details
detail = parser.add_argument_group('Reader Model Details')
detail.add_argument('--concat-rnn-layers', type='bool', default=True,
help='Combine hidden states from each encoding layer')
detail.add_argument('--question-merge', type=str, default='self_attn',
help='The way of computing the question representation')
detail.add_argument('--use-qemb', type='bool', default=True,
help='Whether to use weighted question embeddings')
detail.add_argument('--use-exact-match', type='bool', default=True,
help='Whether to use in_question_* features')
detail.add_argument('--use-pos', type='bool', default=True,
help='Whether to use pos features')
detail.add_argument('--use-ner', type='bool', default=True,
help='Whether to use ner features')
detail.add_argument('--use-lemma', type='bool', default=True,
help='Whether to use lemma features')
detail.add_argument('--use-tf', type='bool', default=True,
help='Whether to use term frequency features')
detail.add_argument('--hop', type=int, default=2,
help='The number of hops for both aligner and the answer pointer in m-reader')
# Optimization details
optim = parser.add_argument_group('Reader Optimization')
optim.add_argument('--dropout-emb', type=float, default=0.2,
help='Dropout rate for word embeddings')
optim.add_argument('--dropout-rnn', type=float, default=0.2,
help='Dropout rate for RNN states')
optim.add_argument('--dropout-rnn-output', type='bool', default=True,
help='Whether to dropout the RNN output')
optim.add_argument('--optimizer', type=str, default='adamax',
help='Optimizer: sgd, adamax, adadelta')
optim.add_argument('--learning-rate', type=float, default=1.0,
help='Learning rate for sgd, adadelta')
optim.add_argument('--grad-clipping', type=float, default=10,
help='Gradient clipping')
optim.add_argument('--weight-decay', type=float, default=0,
help='Weight decay factor')
optim.add_argument('--momentum', type=float, default=0,
help='Momentum factor')
optim.add_argument('--rho', type=float, default=0.95,
help='Rho for adadelta')
optim.add_argument('--eps', type=float, default=1e-6,
help='Eps for adadelta')
optim.add_argument('--fix-embeddings', type='bool', default=True,
help='Keep word embeddings fixed (use pretrained)')
optim.add_argument('--tune-partial', type=int, default=0,
help='Backprop through only the top N question words')
optim.add_argument('--rnn-padding', type='bool', default=False,
help='Explicitly account for padding in RNN encoding')
optim.add_argument('--max-len', type=int, default=15,
help='The max span allowed during decoding')
def get_model_args(args):
"""Filter args for model ones.
From a args Namespace, return a new Namespace with *only* the args specific
to the model architecture or optimization. (i.e. the ones defined here.)
"""
global MODEL_ARCHITECTURE, MODEL_OPTIMIZER
required_args = MODEL_ARCHITECTURE | MODEL_OPTIMIZER
arg_values = {k: v for k, v in vars(args).items() if k in required_args}
return argparse.Namespace(**arg_values)
def override_model_args(old_args, new_args):
"""Set args to new parameters.
Decide which model args to keep and which to override when resolving a set
of saved args and new args.
We keep the new optimation, but leave the model architecture alone.
"""
global MODEL_OPTIMIZER
old_args, new_args = vars(old_args), vars(new_args)
for k in old_args.keys():
if k in new_args and old_args[k] != new_args[k]:
if k in MODEL_OPTIMIZER:
logger.info('Overriding saved %s: %s --> %s' %
(k, old_args[k], new_args[k]))
old_args[k] = new_args[k]
else:
logger.info('Keeping saved %s: %s' % (k, old_args[k]))
return argparse.Namespace(**old_args)
================================================
FILE: data.py
================================================
#!/usr/bin/env python3
# Copyright 2018-present, HKUST-KnowComp.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
"""Data processing/loading helpers."""
import numpy as np
import logging
import unicodedata
from torch.utils.data import Dataset
from torch.utils.data.sampler import Sampler
from vector import vectorize
logger = logging.getLogger(__name__)
# ------------------------------------------------------------------------------
# Dictionary class for tokens.
# ------------------------------------------------------------------------------
class Dictionary(object):
NULL = '<NULL>'
UNK = '<UNK>'
START = 2
@staticmethod
def normalize(token):
return unicodedata.normalize('NFD', token)
def __init__(self):
self.tok2ind = {self.NULL: 0, self.UNK: 1}
self.ind2tok = {0: self.NULL, 1: self.UNK}
def __len__(self):
return len(self.tok2ind)
def __iter__(self):
return iter(self.tok2ind)
def __contains__(self, key):
if type(key) == int:
return key in self.ind2tok
elif type(key) == str:
return self.normalize(key) in self.tok2ind
def __getitem__(self, key):
if type(key) == int:
return self.ind2tok.get(key, self.UNK)
if type(key) == str:
return self.tok2ind.get(self.normalize(key),
self.tok2ind.get(self.UNK))
def __setitem__(self, key, item):
if type(key) == int and type(item) == str:
self.ind2tok[key] = item
elif type(key) == str and type(item) == int:
self.tok2ind[key] = item
else:
raise RuntimeError('Invalid (key, item) types.')
def add(self, token):
token = self.normalize(token)
if token not in self.tok2ind:
index = len(self.tok2ind)
self.tok2ind[token] = index
self.ind2tok[index] = token
def tokens(self):
"""Get dictionary tokens.
Return all the words indexed by this dictionary, except for special
tokens.
"""
tokens = [k for k in self.tok2ind.keys()
if k not in {'<NULL>', '<UNK>'}]
return tokens
# ------------------------------------------------------------------------------
# PyTorch dataset class for SQuAD (and SQuAD-like) data.
# ------------------------------------------------------------------------------
class ReaderDataset(Dataset):
def __init__(self, examples, model, single_answer=False):
self.model = model
self.examples = examples
self.single_answer = single_answer
def __len__(self):
return len(self.examples)
def __getitem__(self, index):
return vectorize(self.examples[index], self.model, self.single_answer)
def lengths(self):
return [(len(ex['document']), len(ex['question']))
for ex in self.examples]
# ------------------------------------------------------------------------------
# PyTorch sampler returning batched of sorted lengths (by doc and question).
# ------------------------------------------------------------------------------
class SortedBatchSampler(Sampler):
def __init__(self, lengths, batch_size, shuffle=True):
self.lengths = lengths
self.batch_size = batch_size
self.shuffle = shuffle
def __iter__(self):
lengths = np.array(
[(-l[0], -l[1], np.random.random()) for l in self.lengths],
dtype=[('l1', np.int_), ('l2', np.int_), ('rand', np.float_)]
)
indices = np.argsort(lengths, order=('l1', 'l2', 'rand'))
batches = [indices[i:i + self.batch_size]
for i in range(0, len(indices), self.batch_size)]
if self.shuffle:
np.random.shuffle(batches)
return iter([i for batch in batches for i in batch])
def __len__(self):
return len(self.lengths)
================================================
FILE: layers.py
================================================
#!/usr/bin/env python3
# Copyright 2018-present, HKUST-KnowComp.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
"""Definitions of model layers/NN modules"""
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
import math
import random
# ------------------------------------------------------------------------------
# Modules
# ------------------------------------------------------------------------------
class StackedBRNN(nn.Module):
"""Stacked Bi-directional RNNs.
Differs from standard PyTorch library in that it has the option to save
and concat the hidden states between layers. (i.e. the output hidden size
for each sequence input is num_layers * hidden_size).
"""
def __init__(self, input_size, hidden_size, num_layers,
dropout_rate=0, dropout_output=False, rnn_type=nn.LSTM,
concat_layers=False, padding=False):
super(StackedBRNN, self).__init__()
self.padding = padding
self.dropout_output = dropout_output
self.dropout_rate = dropout_rate
self.num_layers = num_layers
self.concat_layers = concat_layers
self.rnns = nn.ModuleList()
for i in range(num_layers):
input_size = input_size if i == 0 else 2 * hidden_size
self.rnns.append(rnn_type(input_size, hidden_size,
num_layers=1,
bidirectional=True))
def forward(self, x, x_mask):
"""Encode either padded or non-padded sequences.
Can choose to either handle or ignore variable length sequences.
Always handle padding in eval.
Args:
x: batch * len * hdim
x_mask: batch * len (1 for padding, 0 for true)
Output:
x_encoded: batch * len * hdim_encoded
"""
if x_mask.data.sum() == 0 or x_mask.data.eq(1).long().sum(1).min() == 0:
# No padding necessary.
output = self._forward_unpadded(x, x_mask)
elif self.padding or not self.training:
# Pad if we care or if its during eval.
output = self._forward_padded(x, x_mask)
else:
# We don't care.
output = self._forward_unpadded(x, x_mask)
return output.contiguous()
def _forward_unpadded(self, x, x_mask):
"""Faster encoding that ignores any padding."""
# Transpose batch and sequence dims
x = x.transpose(0, 1)
# Encode all layers
outputs = [x]
for i in range(self.num_layers):
rnn_input = outputs[-1]
# Apply dropout to hidden input
if self.dropout_rate > 0:
rnn_input = F.dropout(rnn_input,
p=self.dropout_rate,
training=self.training)
# Forward
rnn_output = self.rnns[i](rnn_input)[0]
outputs.append(rnn_output)
# Concat hidden layers
if self.concat_layers:
output = torch.cat(outputs[1:], 2)
else:
output = outputs[-1]
# Transpose back
output = output.transpose(0, 1)
# Dropout on output layer
if self.dropout_output and self.dropout_rate > 0:
output = F.dropout(output,
p=self.dropout_rate,
training=self.training)
return output
def _forward_padded(self, x, x_mask):
"""Slower (significantly), but more precise, encoding that handles
padding.
"""
# Compute sorted sequence lengths
lengths = x_mask.data.eq(0).long().sum(1).squeeze()
_, idx_sort = torch.sort(lengths, dim=0, descending=True)
_, idx_unsort = torch.sort(idx_sort, dim=0)
lengths = list(lengths[idx_sort])
idx_sort = Variable(idx_sort)
idx_unsort = Variable(idx_unsort)
# Sort x
x = x.index_select(0, idx_sort)
# Transpose batch and sequence dims
x = x.transpose(0, 1)
# Pack it up
rnn_input = nn.utils.rnn.pack_padded_sequence(x, lengths)
# Encode all layers
outputs = [rnn_input]
for i in range(self.num_layers):
rnn_input = outputs[-1]
# Apply dropout to input
if self.dropout_rate > 0:
dropout_input = F.dropout(rnn_input.data,
p=self.dropout_rate,
training=self.training)
rnn_input = nn.utils.rnn.PackedSequence(dropout_input,
rnn_input.batch_sizes)
outputs.append(self.rnns[i](rnn_input)[0])
# Unpack everything
for i, o in enumerate(outputs[1:], 1):
outputs[i] = nn.utils.rnn.pad_packed_sequence(o)[0]
# Concat hidden layers or take final
if self.concat_layers:
output = torch.cat(outputs[1:], 2)
else:
output = outputs[-1]
# Transpose and unsort
output = output.transpose(0, 1)
output = output.index_select(0, idx_unsort)
# Pad up to original batch sequence length
if output.size(1) != x_mask.size(1):
padding = torch.zeros(output.size(0),
x_mask.size(1) - output.size(1),
output.size(2)).type(output.data.type())
output = torch.cat([output, Variable(padding)], 1)
# Dropout on output layer
if self.dropout_output and self.dropout_rate > 0:
output = F.dropout(output,
p=self.dropout_rate,
training=self.training)
return output
class FeedForwardNetwork(nn.Module):
def __init__(self, input_size, hidden_size, output_size, dropout_rate=0):
super(FeedForwardNetwork, self).__init__()
self.dropout_rate = dropout_rate
self.linear1 = nn.Linear(input_size, hidden_size)
self.linear2 = nn.Linear(hidden_size, output_size)
def forward(self, x):
x_proj = F.dropout(F.relu(self.linear1(x)), p=self.dropout_rate, training=self.training)
x_proj = self.linear2(x_proj)
return x_proj
class PointerNetwork(nn.Module):
def __init__(self, x_size, y_size, hidden_size, dropout_rate=0, cell_type=nn.GRUCell, normalize=True):
super(PointerNetwork, self).__init__()
self.normalize = normalize
self.hidden_size = hidden_size
self.dropout_rate = dropout_rate
self.linear = nn.Linear(x_size+y_size, hidden_size, bias=False)
self.weights = nn.Linear(hidden_size, 1, bias=False)
self.self_attn = NonLinearSeqAttn(y_size, hidden_size)
self.cell = cell_type(x_size, y_size)
def init_hiddens(self, y, y_mask):
attn = self.self_attn(y, y_mask)
res = attn.unsqueeze(1).bmm(y).squeeze(1) # [B, I]
return res
def pointer(self, x, state, x_mask):
x_ = torch.cat([x, state.unsqueeze(1).repeat(1,x.size(1),1)], 2)
s0 = F.tanh(self.linear(x_))
s = self.weights(s0).view(x.size(0), x.size(1))
s.data.masked_fill_(x_mask.data, -float('inf'))
a = F.softmax(s)
res = a.unsqueeze(1).bmm(x).squeeze(1)
if self.normalize:
if self.training:
# In training we output log-softmax for NLL
scores = F.log_softmax(s)
else:
# ...Otherwise 0-1 probabilities
scores = F.softmax(s)
else:
scores = a.exp()
return res, scores
def forward(self, x, y, x_mask, y_mask):
hiddens = self.init_hiddens(y, y_mask)
c, start_scores = self.pointer(x, hiddens, x_mask)
c_ = F.dropout(c, p=self.dropout_rate, training=self.training)
hiddens = self.cell(c_, hiddens)
c, end_scores = self.pointer(x, hiddens, x_mask)
return start_scores, end_scores
class MemoryAnsPointer(nn.Module):
def __init__(self, x_size, y_size, hidden_size, hop=1, dropout_rate=0, normalize=True):
super(MemoryAnsPointer, self).__init__()
self.normalize = normalize
self.hidden_size = hidden_size
self.hop = hop
self.dropout_rate = dropout_rate
self.FFNs_start = nn.ModuleList()
self.SFUs_start = nn.ModuleList()
self.FFNs_end = nn.ModuleList()
self.SFUs_end = nn.ModuleList()
for i in range(self.hop):
self.FFNs_start.append(FeedForwardNetwork(x_size+y_size+2*hidden_size, hidden_size, 1, dropout_rate))
self.SFUs_start.append(SFU(y_size, 2*hidden_size))
self.FFNs_end.append(FeedForwardNetwork(x_size+y_size+2*hidden_size, hidden_size, 1, dropout_rate))
self.SFUs_end.append(SFU(y_size, 2*hidden_size))
def forward(self, x, y, x_mask, y_mask):
z_s = y[:,-1,:].unsqueeze(1) # [B, 1, I]
z_e = None
s = None
e = None
p_s = None
p_e = None
for i in range(self.hop):
z_s_ = z_s.repeat(1,x.size(1),1) # [B, S, I]
s = self.FFNs_start[i](torch.cat([x, z_s_, x*z_s_], 2)).squeeze(2)
s.data.masked_fill_(x_mask.data, -float('inf'))
p_s = F.softmax(s, dim=1) # [B, S]
u_s = p_s.unsqueeze(1).bmm(x) # [B, 1, I]
z_e = self.SFUs_start[i](z_s, u_s) # [B, 1, I]
z_e_ = z_e.repeat(1,x.size(1),1) # [B, S, I]
e = self.FFNs_end[i](torch.cat([x, z_e_, x*z_e_], 2)).squeeze(2)
e.data.masked_fill_(x_mask.data, -float('inf'))
p_e = F.softmax(e, dim=1) # [B, S]
u_e = p_e.unsqueeze(1).bmm(x) # [B, 1, I]
z_s = self.SFUs_end[i](z_e, u_e)
if self.normalize:
if self.training:
# In training we output log-softmax for NLL
p_s = F.log_softmax(s, dim=1) # [B, S]
p_e = F.log_softmax(e, dim=1) # [B, S]
else:
# ...Otherwise 0-1 probabilities
p_s = F.softmax(s, dim=1) # [B, S]
p_e = F.softmax(e, dim=1) # [B, S]
else:
p_s = s.exp()
p_e = e.exp()
return p_s, p_e
# ------------------------------------------------------------------------------
# Attentions
# ------------------------------------------------------------------------------
class SeqAttnMatch(nn.Module):
"""Given sequences X and Y, match sequence Y to each element in X.
* o_i = sum(alpha_j * y_j) for i in X
* alpha_j = softmax(y_j * x_i)
"""
def __init__(self, input_size, identity=False):
super(SeqAttnMatch, self).__init__()
if not identity:
self.linear = nn.Linear(input_size, input_size)
else:
self.linear = None
def forward(self, x, y, y_mask):
"""
Args:
x: batch * len1 * hdim
y: batch * len2 * hdim
y_mask: batch * len2 (1 for padding, 0 for true)
Output:
matched_seq: batch * len1 * hdim
"""
# Project vectors
if self.linear:
x_proj = self.linear(x.view(-1, x.size(2))).view(x.size())
x_proj = F.relu(x_proj)
y_proj = self.linear(y.view(-1, y.size(2))).view(y.size())
y_proj = F.relu(y_proj)
else:
x_proj = x
y_proj = y
# Compute scores
scores = x_proj.bmm(y_proj.transpose(2, 1))
# Mask padding
y_mask = y_mask.unsqueeze(1).expand(scores.size())
scores.data.masked_fill_(y_mask.data, -float('inf'))
# Normalize with softmax
alpha = F.softmax(scores, dim=2)
# Take weighted average
matched_seq = alpha.bmm(y)
return matched_seq
class SelfAttnMatch(nn.Module):
"""Given sequences X and Y, match sequence Y to each element in X.
* o_i = sum(alpha_j * x_j) for i in X
* alpha_j = softmax(x_j * x_i)
"""
def __init__(self, input_size, identity=False, diag=True):
super(SelfAttnMatch, self).__init__()
if not identity:
self.linear = nn.Linear(input_size, input_size)
else:
self.linear = None
self.diag = diag
def forward(self, x, x_mask):
"""
Args:
x: batch * len1 * dim1
x_mask: batch * len1 (1 for padding, 0 for true)
Output:
matched_seq: batch * len1 * dim1
"""
# Project vectors
if self.linear:
x_proj = self.linear(x.view(-1, x.size(2))).view(x.size())
x_proj = F.relu(x_proj)
else:
x_proj = x
# Compute scores
scores = x_proj.bmm(x_proj.transpose(2, 1))
if not self.diag:
x_len = x.size(1)
for i in range(x_len):
scores[:, i, i] = 0
# Mask padding
x_mask = x_mask.unsqueeze(1).expand(scores.size())
scores.data.masked_fill_(x_mask.data, -float('inf'))
# Normalize with softmax
alpha = F.softmax(scores, dim=2)
# Take weighted average
matched_seq = alpha.bmm(x)
return matched_seq
class BilinearSeqAttn(nn.Module):
"""A bilinear attention layer over a sequence X w.r.t y:
* o_i = softmax(x_i'Wy) for x_i in X.
Optionally don't normalize output weights.
"""
def __init__(self, x_size, y_size, identity=False, normalize=True):
super(BilinearSeqAttn, self).__init__()
self.normalize = normalize
# If identity is true, we just use a dot product without transformation.
if not identity:
self.linear = nn.Linear(y_size, x_size)
else:
self.linear = None
def forward(self, x, y, x_mask):
"""
Args:
x: batch * len * hdim1
y: batch * hdim2
x_mask: batch * len (1 for padding, 0 for true)
Output:
alpha = batch * len
"""
Wy = self.linear(y) if self.linear is not None else y
xWy = x.bmm(Wy.unsqueeze(2)).squeeze(2)
xWy.data.masked_fill_(x_mask.data, -float('inf'))
if self.normalize:
if self.training:
# In training we output log-softmax for NLL
alpha = F.log_softmax(xWy)
else:
# ...Otherwise 0-1 probabilities
alpha = F.softmax(xWy)
else:
alpha = xWy.exp()
return alpha
class LinearSeqAttn(nn.Module):
"""Self attention over a sequence:
* o_i = softmax(Wx_i) for x_i in X.
"""
def __init__(self, input_size):
super(LinearSeqAttn, self).__init__()
self.linear = nn.Linear(input_size, 1)
def forward(self, x, x_mask):
"""
Args:
x: batch * len * hdim
x_mask: batch * len (1 for padding, 0 for true)
Output:
alpha: batch * len
"""
x_flat = x.view(-1, x.size(-1))
scores = self.linear(x_flat).view(x.size(0), x.size(1))
scores.data.masked_fill_(x_mask.data, -float('inf'))
alpha = F.softmax(scores)
return alpha
class NonLinearSeqAttn(nn.Module):
"""Self attention over a sequence:
* o_i = softmax(function(Wx_i)) for x_i in X.
"""
def __init__(self, input_size, hidden_size):
super(NonLinearSeqAttn, self).__init__()
self.FFN = FeedForwardNetwork(input_size, hidden_size, 1)
def forward(self, x, x_mask):
"""
Args:
x: batch * len * dim
x_mask: batch * len (1 for padding, 0 for true)
Output:
alpha: batch * len
"""
scores = self.FFN(x).squeeze(2)
scores.data.masked_fill_(x_mask.data, -float('inf'))
alpha = F.softmax(scores)
return alpha
# ------------------------------------------------------------------------------
# Functional Units
# ------------------------------------------------------------------------------
class Gate(nn.Module):
"""Gate Unit
g = sigmoid(Wx)
x = g * x
"""
def __init__(self, input_size):
super(Gate, self).__init__()
self.linear = nn.Linear(input_size, input_size, bias=False)
def forward(self, x):
"""
Args:
x: batch * len * dim
x_mask: batch * len (1 for padding, 0 for true)
Output:
res: batch * len * dim
"""
x_proj = self.linear(x)
gate = F.sigmoid(x)
return x_proj * gate
class SFU(nn.Module):
"""Semantic Fusion Unit
The ouput vector is expected to not only retrieve correlative information from fusion vectors,
but also retain partly unchange as the input vector
"""
def __init__(self, input_size, fusion_size):
super(SFU, self).__init__()
self.linear_r = nn.Linear(input_size + fusion_size, input_size)
self.linear_g = nn.Linear(input_size + fusion_size, input_size)
def forward(self, x, fusions):
r_f = torch.cat([x, fusions], 2)
r = F.tanh(self.linear_r(r_f))
g = F.sigmoid(self.linear_g(r_f))
o = g * r + (1-g) * x
return o
# ------------------------------------------------------------------------------
# Functional
# ------------------------------------------------------------------------------
def uniform_weights(x, x_mask):
"""Return uniform weights over non-masked x (a sequence of vectors).
Args:
x: batch * len * hdim
x_mask: batch * len (1 for padding, 0 for true)
Output:
x_avg: batch * hdim
"""
alpha = Variable(torch.ones(x.size(0), x.size(1)))
if x.data.is_cuda:
alpha = alpha.cuda()
alpha = alpha * x_mask.eq(0).float()
alpha = alpha / alpha.sum(1).expand(alpha.size())
return alpha
def weighted_avg(x, weights):
"""Return a weighted average of x (a sequence of vectors).
Args:
x: batch * len * hdim
weights: batch * len, sum(dim = 1) = 1
Output:
x_avg: batch * hdim
"""
return weights.unsqueeze(1).bmm(x).squeeze(1)
================================================
FILE: m_reader.py
================================================
#!/usr/bin/env python3
# Copyright 2018-present, HKUST-KnowComp.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
"""Implementation of the Mnemonic Reader."""
import torch
import torch.nn as nn
import torch.nn.functional as F
import layers
from torch.autograd import Variable
# ------------------------------------------------------------------------------
# Network
# ------------------------------------------------------------------------------
class MnemonicReader(nn.Module):
RNN_TYPES = {'lstm': nn.LSTM, 'gru': nn.GRU, 'rnn': nn.RNN}
CELL_TYPES = {'lstm': nn.LSTMCell, 'gru': nn.GRUCell, 'rnn': nn.RNNCell}
def __init__(self, args, normalize=True):
super(MnemonicReader, self).__init__()
# Store config
self.args = args
# Word embeddings (+1 for padding)
self.embedding = nn.Embedding(args.vocab_size,
args.embedding_dim,
padding_idx=0)
# Char embeddings (+1 for padding)
self.char_embedding = nn.Embedding(args.char_size,
args.char_embedding_dim,
padding_idx=0)
# Char rnn to generate char features
self.char_rnn = layers.StackedBRNN(
input_size=args.char_embedding_dim,
hidden_size=args.char_hidden_size,
num_layers=1,
dropout_rate=args.dropout_rnn,
dropout_output=args.dropout_rnn_output,
concat_layers=False,
rnn_type=self.RNN_TYPES[args.rnn_type],
padding=False,
)
doc_input_size = args.embedding_dim + args.char_hidden_size * 2 + args.num_features
# Encoder
self.encoding_rnn = layers.StackedBRNN(
input_size=doc_input_size,
hidden_size=args.hidden_size,
num_layers=1,
dropout_rate=args.dropout_rnn,
dropout_output=args.dropout_rnn_output,
concat_layers=False,
rnn_type=self.RNN_TYPES[args.rnn_type],
padding=args.rnn_padding,
)
doc_hidden_size = 2 * args.hidden_size
# Interactive aligning, self aligning and aggregating
self.interactive_aligners = nn.ModuleList()
self.interactive_SFUs = nn.ModuleList()
self.self_aligners = nn.ModuleList()
self.self_SFUs = nn.ModuleList()
self.aggregate_rnns = nn.ModuleList()
for i in range(args.hop):
# interactive aligner
self.interactive_aligners.append(layers.SeqAttnMatch(doc_hidden_size, identity=True))
self.interactive_SFUs.append(layers.SFU(doc_hidden_size, 3 * doc_hidden_size))
# self aligner
self.self_aligners.append(layers.SelfAttnMatch(doc_hidden_size, identity=True, diag=False))
self.self_SFUs.append(layers.SFU(doc_hidden_size, 3 * doc_hidden_size))
# aggregating
self.aggregate_rnns.append(
layers.StackedBRNN(
input_size=doc_hidden_size,
hidden_size=args.hidden_size,
num_layers=1,
dropout_rate=args.dropout_rnn,
dropout_output=args.dropout_rnn_output,
concat_layers=False,
rnn_type=self.RNN_TYPES[args.rnn_type],
padding=args.rnn_padding,
)
)
# Memmory-based Answer Pointer
self.mem_ans_ptr = layers.MemoryAnsPointer(
x_size=2*args.hidden_size,
y_size=2*args.hidden_size,
hidden_size=args.hidden_size,
hop=args.hop,
dropout_rate=args.dropout_rnn,
normalize=normalize
)
def forward(self, x1, x1_c, x1_f, x1_mask, x2, x2_c, x2_f, x2_mask):
"""Inputs:
x1 = document word indices [batch * len_d]
x1_c = document char indices [batch * len_d]
x1_f = document word features indices [batch * len_d * nfeat]
x1_mask = document padding mask [batch * len_d]
x2 = question word indices [batch * len_q]
x2_c = document char indices [batch * len_d]
x1_f = document word features indices [batch * len_d * nfeat]
x2_mask = question padding mask [batch * len_q]
"""
# Embed both document and question
x1_emb = self.embedding(x1)
x2_emb = self.embedding(x2)
x1_c_emb = self.char_embedding(x1_c)
x2_c_emb = self.char_embedding(x2_c)
# Dropout on embeddings
if self.args.dropout_emb > 0:
x1_emb = F.dropout(x1_emb, p=self.args.dropout_emb, training=self.training)
x2_emb = F.dropout(x2_emb, p=self.args.dropout_emb, training=self.training)
x1_c_emb = F.dropout(x1_c_emb, p=self.args.dropout_emb, training=self.training)
x2_c_emb = F.dropout(x2_c_emb, p=self.args.dropout_emb, training=self.training)
# Generate char features
x1_c_features = self.char_rnn(
x1_c_emb.reshape((x1_c_emb.size(0) * x1_c_emb.size(1), x1_c_emb.size(2), x1_c_emb.size(3))),
x1_mask.unsqueeze(2).repeat(1, 1, x1_c_emb.size(2)).reshape((x1_c_emb.size(0) * x1_c_emb.size(1), x1_c_emb.size(2)))
).reshape((x1_c_emb.size(0), x1_c_emb.size(1), x1_c_emb.size(2), -1))[:,:,-1,:]
x2_c_features = self.char_rnn(
x2_c_emb.reshape((x2_c_emb.size(0) * x2_c_emb.size(1), x2_c_emb.size(2), x2_c_emb.size(3))),
x2_mask.unsqueeze(2).repeat(1, 1, x2_c_emb.size(2)).reshape((x2_c_emb.size(0) * x2_c_emb.size(1), x2_c_emb.size(2)))
).reshape((x2_c_emb.size(0), x2_c_emb.size(1), x2_c_emb.size(2), -1))[:,:,-1,:]
# Combine input
crnn_input = [x1_emb, x1_c_features]
qrnn_input = [x2_emb, x2_c_features]
# Add manual features
if self.args.num_features > 0:
crnn_input.append(x1_f)
qrnn_input.append(x2_f)
# Encode document with RNN
c = self.encoding_rnn(torch.cat(crnn_input, 2), x1_mask)
# Encode question with RNN
q = self.encoding_rnn(torch.cat(qrnn_input, 2), x2_mask)
# Align and aggregate
c_check = c
for i in range(self.args.hop):
q_tilde = self.interactive_aligners[i].forward(c_check, q, x2_mask)
c_bar = self.interactive_SFUs[i].forward(c_check, torch.cat([q_tilde, c_check * q_tilde, c_check - q_tilde], 2))
c_tilde = self.self_aligners[i].forward(c_bar, x1_mask)
c_hat = self.self_SFUs[i].forward(c_bar, torch.cat([c_tilde, c_bar * c_tilde, c_bar - c_tilde], 2))
c_check = self.aggregate_rnns[i].forward(c_hat, x1_mask)
# Predict
start_scores, end_scores = self.mem_ans_ptr.forward(c_check, q, x1_mask, x2_mask)
return start_scores, end_scores
================================================
FILE: model.py
================================================
#!/usr/bin/env python3
# Copyright 2018-present, HKUST-KnowComp.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
"""Document Reader model"""
import torch
import torch.optim as optim
import torch.nn.functional as F
import numpy as np
import logging
import copy
from torch.autograd import Variable
from config import override_model_args
from r_net import R_Net
from rnn_reader import RnnDocReader
from m_reader import MnemonicReader
from data import Dictionary
logger = logging.getLogger(__name__)
class DocReader(object):
"""High level model that handles intializing the underlying network
architecture, saving, updating examples, and predicting examples.
"""
# --------------------------------------------------------------------------
# Initialization
# --------------------------------------------------------------------------
def __init__(self, args, word_dict, char_dict, feature_dict,
state_dict=None, normalize=True):
# Book-keeping.
self.args = args
self.word_dict = word_dict
self.char_dict = char_dict
self.args.vocab_size = len(word_dict)
self.args.char_size = len(char_dict)
self.feature_dict = feature_dict
self.args.num_features = len(feature_dict)
self.updates = 0
self.use_cuda = False
self.parallel = False
# Building network. If normalize if false, scores are not normalized
# 0-1 per paragraph (no softmax).
if args.model_type == 'rnn':
self.network = RnnDocReader(args, normalize)
elif args.model_type == 'r_net':
self.network = R_Net(args, normalize)
elif args.model_type == 'mnemonic':
self.network = MnemonicReader(args, normalize)
else:
raise RuntimeError('Unsupported model: %s' % args.model_type)
# Load saved state
if state_dict:
# Load buffer separately
if 'fixed_embedding' in state_dict:
fixed_embedding = state_dict.pop('fixed_embedding')
self.network.load_state_dict(state_dict)
self.network.register_buffer('fixed_embedding', fixed_embedding)
else:
self.network.load_state_dict(state_dict)
def expand_dictionary(self, words):
"""Add words to the DocReader dictionary if they do not exist. The
underlying embedding matrix is also expanded (with random embeddings).
Args:
words: iterable of tokens to add to the dictionary.
Output:
added: set of tokens that were added.
"""
to_add = {self.word_dict.normalize(w) for w in words
if w not in self.word_dict}
# Add words to dictionary and expand embedding layer
if len(to_add) > 0:
logger.info('Adding %d new words to dictionary...' % len(to_add))
for w in to_add:
self.word_dict.add(w)
self.args.vocab_size = len(self.word_dict)
logger.info('New vocab size: %d' % len(self.word_dict))
old_embedding = self.network.embedding.weight.data
self.network.embedding = torch.nn.Embedding(self.args.vocab_size,
self.args.embedding_dim,
padding_idx=0)
new_embedding = self.network.embedding.weight.data
new_embedding[:old_embedding.size(0)] = old_embedding
# Return added words
return to_add
def expand_char_dictionary(self, chars):
"""Add chars to the DocReader dictionary if they do not exist. The
underlying embedding matrix is also expanded (with random embeddings).
Args:
chars: iterable of tokens to add to the dictionary.
Output:
added: set of tokens that were added.
"""
to_add = {self.char_dict.normalize(w) for w in chars
if w not in self.char_dict}
# Add chars to dictionary and expand embedding layer
if len(to_add) > 0:
logger.info('Adding %d new chars to dictionary...' % len(to_add))
for w in to_add:
self.char_dict.add(w)
self.args.char_size = len(self.char_dict)
logger.info('New char size: %d' % len(self.char_dict))
old_char_embedding = self.network.char_embedding.weight.data
self.network.char_embedding = torch.nn.Embedding(self.args.char_size,
self.args.char_embedding_dim,
padding_idx=0)
new_char_embedding = self.network.char_embedding.weight.data
new_char_embedding[:old_char_embedding.size(0)] = old_char_embedding
# Return added chars
return to_add
def load_embeddings(self, words, embedding_file):
"""Load pretrained embeddings for a given list of words, if they exist.
Args:
words: iterable of tokens. Only those that are indexed in the
dictionary are kept.
embedding_file: path to text file of embeddings, space separated.
"""
words = {w for w in words if w in self.word_dict}
logger.info('Loading pre-trained embeddings for %d words from %s' %
(len(words), embedding_file))
embedding = self.network.embedding.weight.data
# When normalized, some words are duplicated. (Average the embeddings).
vec_counts = {}
with open(embedding_file) as f:
for line in f:
parsed = line.rstrip().split(' ')
assert(len(parsed) == embedding.size(1) + 1)
w = self.word_dict.normalize(parsed[0])
if w in words:
vec = torch.Tensor([float(i) for i in parsed[1:]])
if w not in vec_counts:
vec_counts[w] = 1
embedding[self.word_dict[w]].copy_(vec)
else:
logging.warning(
'WARN: Duplicate embedding found for %s' % w
)
vec_counts[w] = vec_counts[w] + 1
embedding[self.word_dict[w]].add_(vec)
for w, c in vec_counts.items():
embedding[self.word_dict[w]].div_(c)
logger.info('Loaded %d embeddings (%.2f%%)' %
(len(vec_counts), 100 * len(vec_counts) / len(words)))
def load_char_embeddings(self, chars, char_embedding_file):
"""Load pretrained embeddings for a given list of chars, if they exist.
Args:
chars: iterable of tokens. Only those that are indexed in the
dictionary are kept.
char_embedding_file: path to text file of embeddings, space separated.
"""
chars = {w for w in chars if w in self.char_dict}
logger.info('Loading pre-trained embeddings for %d chars from %s' %
(len(chars), char_embedding_file))
char_embedding = self.network.char_embedding.weight.data
# When normalized, some chars are duplicated. (Average the embeddings).
vec_counts = {}
with open(char_embedding_file) as f:
for line in f:
parsed = line.rstrip().split(' ')
assert(len(parsed) == char_embedding.size(1) + 1)
w = self.char_dict.normalize(parsed[0])
if w in chars:
vec = torch.Tensor([float(i) for i in parsed[1:]])
if w not in vec_counts:
vec_counts[w] = 1
char_embedding[self.char_dict[w]].copy_(vec)
else:
logging.warning(
'WARN: Duplicate char embedding found for %s' % w
)
vec_counts[w] = vec_counts[w] + 1
char_embedding[self.char_dict[w]].add_(vec)
for w, c in vec_counts.items():
char_embedding[self.char_dict[w]].div_(c)
logger.info('Loaded %d char embeddings (%.2f%%)' %
(len(vec_counts), 100 * len(vec_counts) / len(chars)))
def tune_embeddings(self, words):
"""Unfix the embeddings of a list of words. This is only relevant if
only some of the embeddings are being tuned (tune_partial = N).
Shuffles the N specified words to the front of the dictionary, and saves
the original vectors of the other N + 1:vocab words in a fixed buffer.
Args:
words: iterable of tokens contained in dictionary.
"""
words = {w for w in words if w in self.word_dict}
if len(words) == 0:
logger.warning('Tried to tune embeddings, but no words given!')
return
if len(words) == len(self.word_dict):
logger.warning('Tuning ALL embeddings in dictionary')
return
# Shuffle words and vectors
embedding = self.network.embedding.weight.data
for idx, swap_word in enumerate(words, self.word_dict.START):
# Get current word + embedding for this index
curr_word = self.word_dict[idx]
curr_emb = embedding[idx].clone()
old_idx = self.word_dict[swap_word]
# Swap embeddings + dictionary indices
embedding[idx].copy_(embedding[old_idx])
embedding[old_idx].copy_(curr_emb)
self.word_dict[swap_word] = idx
self.word_dict[idx] = swap_word
self.word_dict[curr_word] = old_idx
self.word_dict[old_idx] = curr_word
# Save the original, fixed embeddings
self.network.register_buffer(
'fixed_embedding', embedding[idx + 1:].clone()
)
def init_optimizer(self, state_dict=None):
"""Initialize an optimizer for the free parameters of the network.
Args:
state_dict: network parameters
"""
if self.args.fix_embeddings:
for p in self.network.embedding.parameters():
p.requires_grad = False
parameters = [p for p in self.network.parameters() if p.requires_grad]
if self.args.optimizer == 'sgd':
self.optimizer = optim.SGD(parameters, lr=self.args.learning_rate,
momentum=self.args.momentum,
weight_decay=self.args.weight_decay)
elif self.args.optimizer == 'adamax':
self.optimizer = optim.Adamax(parameters,
weight_decay=self.args.weight_decay)
elif self.args.optimizer == 'adadelta':
self.optimizer = optim.Adadelta(parameters, lr=self.args.learning_rate,
rho=self.args.rho, eps=self.args.eps,
weight_decay=self.args.weight_decay)
else:
raise RuntimeError('Unsupported optimizer: %s' %
self.args.optimizer)
# --------------------------------------------------------------------------
# Learning
# --------------------------------------------------------------------------
def update(self, ex):
"""Forward a batch of examples; step the optimizer to update weights."""
if not self.optimizer:
raise RuntimeError('No optimizer set.')
# Train mode
self.network.train()
# Transfer to GPU
if self.use_cuda:
inputs = [e if e is None else Variable(e.cuda(async=True)) for e in ex[:-3]]
target_s = Variable(ex[-3].cuda(async=True))
target_e = Variable(ex[-2].cuda(async=True))
else:
inputs = [e if e is None else Variable(e) for e in ex[:-3]]
target_s = Variable(ex[-3])
target_e = Variable(ex[-2])
# Run forward
score_s, score_e = self.network(*inputs)
# Compute loss and accuracies
loss = F.nll_loss(score_s, target_s) + F.nll_loss(score_e, target_e)
# Clear gradients and run backward
self.optimizer.zero_grad()
loss.backward()
# Clip gradients
torch.nn.utils.clip_grad_norm(self.network.parameters(),
self.args.grad_clipping)
# Update parameters
self.optimizer.step()
self.updates += 1
# Reset any partially fixed parameters (e.g. rare words)
self.reset_parameters()
return loss.data[0], ex[0].size(0)
def reset_parameters(self):
"""Reset any partially fixed parameters to original states."""
# Reset fixed embeddings to original value
if self.args.tune_partial > 0:
# Embeddings to fix are indexed after the special + N tuned words
offset = self.args.tune_partial + self.word_dict.START
if self.parallel:
embedding = self.network.module.embedding.weight.data
fixed_embedding = self.network.module.fixed_embedding
else:
embedding = self.network.embedding.weight.data
fixed_embedding = self.network.fixed_embedding
if offset < embedding.size(0):
embedding[offset:] = fixed_embedding
# --------------------------------------------------------------------------
# Prediction
# --------------------------------------------------------------------------
def predict(self, ex, candidates=None, top_n=1, async_pool=None):
"""Forward a batch of examples only to get predictions.
Args:
ex: the batch
candidates: batch * variable length list of string answer options.
The model will only consider exact spans contained in this list.
top_n: Number of predictions to return per batch element.
async_pool: If provided, non-gpu post-processing will be offloaded
to this CPU process pool.
Output:
pred_s: batch * top_n predicted start indices
pred_e: batch * top_n predicted end indices
pred_score: batch * top_n prediction scores
If async_pool is given, these will be AsyncResult handles.
"""
# Eval mode
self.network.eval()
# Transfer to GPU
if self.use_cuda:
inputs = [e if e is None else
Variable(e.cuda(async=True), volatile=True)
for e in ex[:8]]
else:
inputs = [e if e is None else Variable(e, volatile=True)
for e in ex[:8]]
# Run forward
score_s, score_e = self.network(*inputs)
del inputs
# Decode predictions
score_s = score_s.data.cpu()
score_e = score_e.data.cpu()
if candidates:
args = (score_s, score_e, candidates, top_n, self.args.max_len)
if async_pool:
return async_pool.apply_async(self.decode_candidates, args)
else:
return self.decode_candidates(*args)
else:
args = (score_s, score_e, top_n, self.args.max_len)
if async_pool:
return async_pool.apply_async(self.decode, args)
else:
return self.decode(*args)
@staticmethod
def decode(score_s, score_e, top_n=1, max_len=None):
"""Take argmax of constrained score_s * score_e.
Args:
score_s: independent start predictions
score_e: independent end predictions
top_n: number of top scored pairs to take
max_len: max span length to consider
"""
pred_s = []
pred_e = []
pred_score = []
max_len = max_len or score_s.size(1)
for i in range(score_s.size(0)):
# Outer product of scores to get full p_s * p_e matrix
scores = torch.ger(score_s[i], score_e[i])
# Zero out negative length and over-length span scores
scores.triu_().tril_(max_len - 1)
# Take argmax or top n
scores = scores.numpy()
scores_flat = scores.flatten()
if top_n == 1:
idx_sort = [np.argmax(scores_flat)]
elif len(scores_flat) < top_n:
idx_sort = np.argsort(-scores_flat)
else:
idx = np.argpartition(-scores_flat, top_n)[0:top_n]
idx_sort = idx[np.argsort(-scores_flat[idx])]
s_idx, e_idx = np.unravel_index(idx_sort, scores.shape)
pred_s.append(s_idx)
pred_e.append(e_idx)
pred_score.append(scores_flat[idx_sort])
del score_s, score_e
return pred_s, pred_e, pred_score
@staticmethod
def decode_candidates(score_s, score_e, candidates, top_n=1, max_len=None):
"""Take argmax of constrained score_s * score_e. Except only consider
spans that are in the candidates list.
"""
pred_s = []
pred_e = []
pred_score = []
for i in range(score_s.size(0)):
# Extract original tokens stored with candidates
tokens = candidates[i]['input']
cands = candidates[i]['cands']
if not cands:
# try getting from globals? (multiprocessing in pipeline mode)
from ..pipeline.wrmcqa import PROCESS_CANDS
cands = PROCESS_CANDS
if not cands:
raise RuntimeError('No candidates given.')
# Score all valid candidates found in text.
# Brute force get all ngrams and compare against the candidate list.
max_len = max_len or len(tokens)
scores, s_idx, e_idx = [], [], []
for s, e in tokens.ngrams(n=max_len, as_strings=False):
span = tokens.slice(s, e).untokenize()
if span in cands or span.lower() in cands:
# Match! Record its score.
scores.append(score_s[i][s] * score_e[i][e - 1])
s_idx.append(s)
e_idx.append(e - 1)
if len(scores) == 0:
# No candidates present
pred_s.append([])
pred_e.append([])
pred_score.append([])
else:
# Rank found candidates
scores = np.array(scores)
s_idx = np.array(s_idx)
e_idx = np.array(e_idx)
idx_sort = np.argsort(-scores)[0:top_n]
pred_s.append(s_idx[idx_sort])
pred_e.append(e_idx[idx_sort])
pred_score.append(scores[idx_sort])
del score_s, score_e, candidates
return pred_s, pred_e, pred_score
# --------------------------------------------------------------------------
# Saving and loading
# --------------------------------------------------------------------------
def save(self, filename):
state_dict = copy.copy(self.network.state_dict())
if 'fixed_embedding' in state_dict:
state_dict.pop('fixed_embedding')
params = {
'state_dict': state_dict,
'word_dict': self.word_dict,
'char_dict': self.char_dict,
'feature_dict': self.feature_dict,
'args': self.args,
}
try:
torch.save(params, filename)
except BaseException:
logger.warning('WARN: Saving failed... continuing anyway.')
def checkpoint(self, filename, epoch):
params = {
'state_dict': self.network.state_dict(),
'word_dict': self.word_dict,
'char_dict': self.char_dict,
'feature_dict': self.feature_dict,
'args': self.args,
'epoch': epoch,
'optimizer': self.optimizer.state_dict(),
}
try:
torch.save(params, filename)
except BaseException:
logger.warning('WARN: Saving failed... continuing anyway.')
@staticmethod
def load(filename, new_args=None, normalize=True):
logger.info('Loading model %s' % filename)
saved_params = torch.load(
filename, map_location=lambda storage, loc: storage
)
word_dict = saved_params['word_dict']
try:
char_dict = saved_params['char_dict']
except KeyError as e:
char_dict = Dictionary()
feature_dict = saved_params['feature_dict']
state_dict = saved_params['state_dict']
args = saved_params['args']
if new_args:
args = override_model_args(args, new_args)
return DocReader(args, word_dict, char_dict, feature_dict, state_dict, normalize)
@staticmethod
def load_checkpoint(filename, normalize=True):
logger.info('Loading model %s' % filename)
saved_params = torch.load(
filename, map_location=lambda storage, loc: storage
)
word_dict = saved_params['word_dict']
char_dict = saved_params['char_dict']
feature_dict = saved_params['feature_dict']
state_dict = saved_params['state_dict']
epoch = saved_params['epoch']
optimizer = saved_params['optimizer']
args = saved_params['args']
model = DocReader(args, word_dict, char_dict, feature_dict, state_dict, normalize)
model.init_optimizer(optimizer)
return model, epoch
# --------------------------------------------------------------------------
# Runtime
# --------------------------------------------------------------------------
def cuda(self):
self.use_cuda = True
self.network = self.network.cuda()
def cpu(self):
self.use_cuda = False
self.network = self.network.cpu()
def parallelize(self):
"""Use data parallel to copy the model across several gpus.
This will take all gpus visible with CUDA_VISIBLE_DEVICES.
"""
self.parallel = True
self.network = torch.nn.DataParallel(self.network)
================================================
FILE: predictor.py
================================================
#!/usr/bin/env python3
# Copyright 2018-present, HKUST-KnowComp.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
"""Machine Comprehension predictor"""
import logging
from multiprocessing import Pool as ProcessPool
from multiprocessing.util import Finalize
from vector import vectorize, batchify
from model import DocReader
import utils
from spacy_tokenizer import SpacyTokenizer
logger = logging.getLogger(__name__)
# ------------------------------------------------------------------------------
# Tokenize + annotate
# ------------------------------------------------------------------------------
TOK = None
def init(options):
global TOK
TOK = SpacyTokenizer(**options)
Finalize(TOK, TOK.shutdown, exitpriority=100)
def tokenize(text):
global TOK
return TOK.tokenize(text)
def get_annotators_for_model(model):
annotators = set()
if model.args.use_pos:
annotators.add('pos')
if model.args.use_lemma:
annotators.add('lemma')
if model.args.use_ner:
annotators.add('ner')
return annotators
# ------------------------------------------------------------------------------
# Predictor class.
# ------------------------------------------------------------------------------
class Predictor(object):
"""Load a pretrained DocReader model and predict inputs on the fly."""
def __init__(self, model, normalize=True,
embedding_file=None, char_embedding_file=None, num_workers=None):
"""
Args:
model: path to saved model file.
normalize: squash output score to 0-1 probabilities with a softmax.
embedding_file: if provided, will expand dictionary to use all
available pretrained vectors in this file.
num_workers: number of CPU processes to use to preprocess batches.
"""
logger.info('Initializing model...')
self.model = DocReader.load(model, normalize=normalize)
if embedding_file:
logger.info('Expanding dictionary...')
utils.index_embedding_words(embedding_file)
added_words = self.model.expand_dictionary(words)
self.model.load_embeddings(added_words, embedding_file)
if char_embedding_file:
logger.info('Expanding dictionary...')
chars = utils.index_embedding_chars(char_embedding_file)
added_chars = self.model.expand_char_dictionary(chars)
self.model.load_char_embeddings(added_chars, char_embedding_file)
logger.info('Initializing tokenizer...')
annotators = get_annotators_for_model(self.model)
if num_workers is None or num_workers > 0:
self.workers = ProcessPool(
num_workers,
initializer=init,
initargs=({'annotators': annotators},),
)
else:
self.workers = None
self.tokenizer = SpacyTokenizer(annotators=annotators)
def predict(self, document, question, candidates=None, top_n=1):
"""Predict a single document - question pair."""
results = self.predict_batch([(document, question, candidates,)], top_n)
return results[0]
def predict_batch(self, batch, top_n=1):
"""Predict a batch of document - question pairs."""
documents, questions, candidates = [], [], []
for b in batch:
documents.append(b[0])
questions.append(b[1])
candidates.append(b[2] if len(b) == 3 else None)
candidates = candidates if any(candidates) else None
# Tokenize the inputs, perhaps multi-processed.
if self.workers:
q_tokens = self.workers.map_async(tokenize, questions)
c_tokens = self.workers.map_async(tokenize, documents)
q_tokens = list(q_tokens.get())
c_tokens = list(c_tokens.get())
else:
q_tokens = list(map(self.tokenizer.tokenize, questions))
c_tokens = list(map(self.tokenizer.tokenize, documents))
examples = []
for i in range(len(questions)):
examples.append({
'id': i,
'question': q_tokens[i].words(),
'question_char': q_tokens[i].chars(),
'qlemma': q_tokens[i].lemmas(),
'qpos': q_tokens[i].pos(),
'qner': q_tokens[i].entities(),
'document': c_tokens[i].words(),
'document_char': c_tokens[i].chars(),
'clemma': c_tokens[i].lemmas(),
'cpos': c_tokens[i].pos(),
'cner': c_tokens[i].entities(),
})
# Stick document tokens in candidates for decoding
if candidates:
candidates = [{'input': c_tokens[i], 'cands': candidates[i]}
for i in range(len(candidates))]
# Build the batch and run it through the model
batch_exs = batchify([vectorize(e, self.model) for e in examples])
s, e, score = self.model.predict(batch_exs, candidates, top_n)
# Retrieve the predicted spans
results = []
for i in range(len(s)):
predictions = []
for j in range(len(s[i])):
span = c_tokens[i].slice(s[i][j], e[i][j] + 1).untokenize()
predictions.append((span, score[i][j]))
results.append(predictions)
return results
def cuda(self):
self.model.cuda()
def cpu(self):
self.model.cpu()
================================================
FILE: r_net.py
================================================
#!/usr/bin/env python3
# Copyright 2018-present, HKUST-KnowComp.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
"""Implementation of the R-Net based reader."""
import torch
import torch.nn as nn
import torch.nn.functional as F
import layers
from torch.autograd import Variable
# ------------------------------------------------------------------------------
# Network
# ------------------------------------------------------------------------------
class R_Net(nn.Module):
RNN_TYPES = {'lstm': nn.LSTM, 'gru': nn.GRU, 'rnn': nn.RNN}
CELL_TYPES = {'lstm': nn.LSTMCell, 'gru': nn.GRUCell, 'rnn': nn.RNNCell}
def __init__(self, args, normalize=True):
super(R_Net, self).__init__()
# Store config
self.args = args
# Word embeddings (+1 for padding)
self.embedding = nn.Embedding(args.vocab_size,
args.embedding_dim,
padding_idx=0)
# Char embeddings (+1 for padding)
self.char_embedding = nn.Embedding(args.char_size,
args.char_embedding_dim,
padding_idx=0)
# Char rnn to generate char features
self.char_rnn = layers.StackedBRNN(
input_size=args.char_embedding_dim,
hidden_size=args.char_hidden_size,
num_layers=1,
dropout_rate=args.dropout_rnn,
dropout_output=args.dropout_rnn_output,
concat_layers=False,
rnn_type=self.RNN_TYPES[args.rnn_type],
padding=False,
)
doc_input_size = args.embedding_dim + args.char_hidden_size * 2
# Encoder
self.encode_rnn = layers.StackedBRNN(
input_size=doc_input_size,
hidden_size=args.hidden_size,
num_layers=args.doc_layers,
dropout_rate=args.dropout_rnn,
dropout_output=args.dropout_rnn_output,
concat_layers=args.concat_rnn_layers,
rnn_type=self.RNN_TYPES[args.rnn_type],
padding=args.rnn_padding,
)
# Output sizes of rnn encoder
doc_hidden_size = 2 * args.hidden_size
question_hidden_size = 2 * args.hidden_size
if args.concat_rnn_layers:
doc_hidden_size *= args.doc_layers
question_hidden_size *= args.question_layers
# Gated-attention-based RNN of the whole question
self.question_attn = layers.SeqAttnMatch(question_hidden_size, identity=False)
self.question_attn_gate = layers.Gate(doc_hidden_size + question_hidden_size)
self.question_attn_rnn = layers.StackedBRNN(
input_size=doc_hidden_size + question_hidden_size,
hidden_size=args.hidden_size,
num_layers=1,
dropout_rate=args.dropout_rnn,
dropout_output=args.dropout_rnn_output,
concat_layers=False,
rnn_type=self.RNN_TYPES[args.rnn_type],
padding=args.rnn_padding,
)
question_attn_hidden_size = 2 * args.hidden_size
# Self-matching-attention-baed RNN of the whole doc
self.doc_self_attn = layers.SelfAttnMatch(question_attn_hidden_size, identity=False)
self.doc_self_attn_gate = layers.Gate(question_attn_hidden_size + question_attn_hidden_size)
self.doc_self_attn_rnn = layers.StackedBRNN(
input_size=question_attn_hidden_size + question_attn_hidden_size,
hidden_size=args.hidden_size,
num_layers=1,
dropout_rate=args.dropout_rnn,
dropout_output=args.dropout_rnn_output,
concat_layers=False,
rnn_type=self.RNN_TYPES[args.rnn_type],
padding=args.rnn_padding,
)
doc_self_attn_hidden_size = 2 * args.hidden_size
self.doc_self_attn_rnn2 = layers.StackedBRNN(
input_size=doc_self_attn_hidden_size,
hidden_size=args.hidden_size,
num_layers=1,
dropout_rate=args.dropout_rnn,
dropout_output=args.dropout_rnn_output,
concat_layers=False,
rnn_type=self.RNN_TYPES[args.rnn_type],
padding=args.rnn_padding,
)
self.ptr_net = layers.PointerNetwork(
x_size = doc_self_attn_hidden_size,
y_size = question_hidden_size,
hidden_size = args.hidden_size,
dropout_rate=args.dropout_rnn,
cell_type=nn.GRUCell,
normalize=normalize
)
def forward(self, x1, x1_c, x1_f, x1_mask, x2, x2_c, x2_f, x2_mask):
"""Inputs:
x1 = document word indices [batch * len_d]
x1_c = document char indices [batch * len_d]
x1_f = document word features indices [batch * len_d * nfeat]
x1_mask = document padding mask [batch * len_d]
x2 = question word indices [batch * len_q]
x2_c = document char indices [batch * len_d]
x1_f = document word features indices [batch * len_d * nfeat]
x2_mask = question padding mask [batch * len_q]
"""
# Embed both document and question
x1_emb = self.embedding(x1)
x2_emb = self.embedding(x2)
x1_c_emb = self.char_embedding(x1_c)
x2_c_emb = self.char_embedding(x2_c)
# Dropout on embeddings
if self.args.dropout_emb > 0:
x1_emb = F.dropout(x1_emb, p=self.args.dropout_emb, training=self.training)
x2_emb = F.dropout(x2_emb, p=self.args.dropout_emb, training=self.training)
x1_c_emb = F.dropout(x1_c_emb, p=self.args.dropout_emb, training=self.training)
x2_c_emb = F.dropout(x2_c_emb, p=self.args.dropout_emb, training=self.training)
# Generate char features
x1_c_features = self.char_rnn(
x1_c_emb.reshape((x1_c_emb.size(0) * x1_c_emb.size(1), x1_c_emb.size(2), x1_c_emb.size(3))),
x1_mask.unsqueeze(2).repeat(1, 1, x1_c_emb.size(2)).reshape((x1_c_emb.size(0) * x1_c_emb.size(1), x1_c_emb.size(2)))
).reshape((x1_c_emb.size(0), x1_c_emb.size(1), x1_c_emb.size(2), -1))[:,:,-1,:]
x2_c_features = self.char_rnn(
x2_c_emb.reshape((x2_c_emb.size(0) * x2_c_emb.size(1), x2_c_emb.size(2), x2_c_emb.size(3))),
x2_mask.unsqueeze(2).repeat(1, 1, x2_c_emb.size(2)).reshape((x2_c_emb.size(0) * x2_c_emb.size(1), x2_c_emb.size(2)))
).reshape((x2_c_emb.size(0), x2_c_emb.size(1), x2_c_emb.size(2), -1))[:,:,-1,:]
# Combine input
crnn_input = [x1_emb, x1_c_features]
qrnn_input = [x2_emb, x2_c_features]
# Encode document with RNN
c = self.encode_rnn(torch.cat(crnn_input, 2), x1_mask)
# Encode question with RNN
q = self.encode_rnn(torch.cat(qrnn_input, 2), x2_mask)
# Match questions to docs
question_attn_hiddens = self.question_attn(c, q, x2_mask)
rnn_input = self.question_attn_gate(torch.cat([c, question_attn_hiddens], 2))
c = self.question_attn_rnn(rnn_input, x1_mask)
# Match documents to themselves
doc_self_attn_hiddens = self.doc_self_attn(c, x1_mask)
rnn_input = self.doc_self_attn_gate(torch.cat([c, doc_self_attn_hiddens], 2))
c = self.doc_self_attn_rnn(rnn_input, x1_mask)
c = self.doc_self_attn_rnn2(c, x1_mask)
# Predict
start_scores, end_scores = self.ptr_net(c, q, x1_mask, x2_mask)
return start_scores, end_scores
================================================
FILE: rnn_reader.py
================================================
#!/usr/bin/env python3
# Copyright 2017-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
"""Implementation of the RNN based DrQA reader."""
import torch
import torch.nn as nn
import layers
# ------------------------------------------------------------------------------
# Network
# ------------------------------------------------------------------------------
class RnnDocReader(nn.Module):
RNN_TYPES = {'lstm': nn.LSTM, 'gru': nn.GRU, 'rnn': nn.RNN}
CELL_TYPES = {'lstm': nn.LSTMCell, 'gru': nn.GRUCell, 'rnn': nn.RNNCell}
def __init__(self, args, normalize=True):
super(RnnDocReader, self).__init__()
# Store config
self.args = args
# Word embeddings (+1 for padding)
self.embedding = nn.Embedding(args.vocab_size,
args.embedding_dim,
padding_idx=0)
# Projection for attention weighted question
if args.use_qemb:
self.qemb_match = layers.SeqAttnMatch(args.embedding_dim)
# Input size to RNN: word emb + question emb + manual features
doc_input_size = args.embedding_dim + args.num_features
if args.use_qemb:
doc_input_size += args.embedding_dim
# RNN document encoder
self.doc_rnn = layers.StackedBRNN(
input_size=doc_input_size,
hidden_size=args.hidden_size,
num_layers=args.doc_layers,
dropout_rate=args.dropout_rnn,
dropout_output=args.dropout_rnn_output,
concat_layers=args.concat_rnn_layers,
rnn_type=self.RNN_TYPES[args.rnn_type],
padding=args.rnn_padding,
)
# RNN question encoder
self.question_rnn = layers.StackedBRNN(
input_size=args.embedding_dim,
hidden_size=args.hidden_size,
num_layers=args.question_layers,
dropout_rate=args.dropout_rnn,
dropout_output=args.dropout_rnn_output,
concat_layers=args.concat_rnn_layers,
rnn_type=self.RNN_TYPES[args.rnn_type],
padding=args.rnn_padding,
)
# Output sizes of rnn encoders
doc_hidden_size = 2 * args.hidden_size
question_hidden_size = 2 * args.hidden_size
if args.concat_rnn_layers:
doc_hidden_size *= args.doc_layers
question_hidden_size *= args.question_layers
# Question merging
if args.question_merge not in ['avg', 'self_attn']:
raise NotImplementedError('merge_mode = %s' % args.merge_mode)
if args.question_merge == 'self_attn':
self.self_attn = layers.LinearSeqAttn(question_hidden_size)
# Bilinear attention for span start/end
self.start_attn = layers.BilinearSeqAttn(
doc_hidden_size,
question_hidden_size,
normalize=normalize,
)
self.end_attn = layers.BilinearSeqAttn(
doc_hidden_size,
question_hidden_size,
normalize=normalize,
)
def forward(self, x1, x1_c, x1_f, x1_mask, x2, x2_c, x2_f, x2_mask):
"""Inputs:
x1 = document word indices [batch * len_d]
x1_f = document word features indices [batch * len_d * nfeat]
x1_mask = document padding mask [batch * len_d]
x2 = question word indices [batch * len_q]
x2_mask = question padding mask [batch * len_q]
"""
# Embed both document and question
x1_emb = self.embedding(x1)
x2_emb = self.embedding(x2)
# Dropout on embeddings
if self.args.dropout_emb > 0:
x1_emb = nn.functional.dropout(x1_emb, p=self.args.dropout_emb,
training=self.training)
x2_emb = nn.functional.dropout(x2_emb, p=self.args.dropout_emb,
training=self.training)
# Form document encoding inputs
drnn_input = [x1_emb]
# Add attention-weighted question representation
if self.args.use_qemb:
x2_weighted_emb = self.qemb_match(x1_emb, x2_emb, x2_mask)
drnn_input.append(x2_weighted_emb)
# Add manual features
if self.args.num_features > 0:
drnn_input.append(x1_f)
# Encode document with RNN
doc_hiddens = self.doc_rnn(torch.cat(drnn_input, 2), x1_mask)
# Encode question with RNN + merge hiddens
question_hiddens = self.question_rnn(x2_emb, x2_mask)
if self.args.question_merge == 'avg':
q_merge_weights = layers.uniform_weights(question_hiddens, x2_mask)
elif self.args.question_merge == 'self_attn':
q_merge_weights = self.self_attn(question_hiddens, x2_mask)
question_hidden = layers.weighted_avg(question_hiddens, q_merge_weights)
# Predict start and end positions
start_scores = self.start_attn(doc_hiddens, question_hidden, x1_mask)
end_scores = self.end_attn(doc_hiddens, question_hidden, x1_mask)
return start_scores, end_scores
================================================
FILE: script/evaluate-v1.1.py
================================================
""" Official evaluation script for v1.1 of the SQuAD dataset. """
from __future__ import print_function
from collections import Counter
import string
import re
import argparse
import json
import sys
def normalize_answer(s):
"""Lower text and remove punctuation, articles and extra whitespace."""
def remove_articles(text):
return re.sub(r'\b(a|an|the)\b', ' ', text)
def white_space_fix(text):
return ' '.join(text.split())
def remove_punc(text):
exclude = set(string.punctuation)
return ''.join(ch for ch in text if ch not in exclude)
def lower(text):
return text.lower()
return white_space_fix(remove_articles(remove_punc(lower(s))))
def f1_score(prediction, ground_truth):
prediction_tokens = normalize_answer(prediction).split()
ground_truth_tokens = normalize_answer(ground_truth).split()
common = Counter(prediction_tokens) & Counter(ground_truth_tokens)
num_same = sum(common.values())
if num_same == 0:
return 0
precision = 1.0 * num_same / len(prediction_tokens)
recall = 1.0 * num_same / len(ground_truth_tokens)
f1 = (2 * precision * recall) / (precision + recall)
return f1
def exact_match_score(prediction, ground_truth):
return (normalize_answer(prediction) == normalize_answer(ground_truth))
def metric_max_over_ground_truths(metric_fn, prediction, ground_truths):
scores_for_ground_truths = []
for ground_truth in ground_truths:
score = metric_fn(prediction, ground_truth)
scores_for_ground_truths.append(score)
return max(scores_for_ground_truths)
def evaluate(dataset, predictions):
f1 = exact_match = total = 0
for article in dataset:
for paragraph in article['paragraphs']:
for qa in paragraph['qas']:
total += 1
if qa['id'] not in predictions:
message = 'Unanswered question ' + qa['id'] + \
' will receive score 0.'
print(message, file=sys.stderr)
continue
ground_truths = list(map(lambda x: x['text'], qa['answers']))
prediction = predictions[qa['id']]
exact_match += metric_max_over_ground_truths(
exact_match_score, prediction, ground_truths)
f1 += metric_max_over_ground_truths(
f1_score, prediction, ground_truths)
exact_match = 100.0 * exact_match / total
f1 = 100.0 * f1 / total
return {'exact_match': exact_match, 'f1': f1}
if __name__ == '__main__':
expected_version = '1.1'
parser = argparse.ArgumentParser(
description='Evaluation for SQuAD ' + expected_version)
parser.add_argument('dataset_file', help='Dataset file')
parser.add_argument('prediction_file', help='Prediction File')
args = parser.parse_args()
with open(args.dataset_file) as dataset_file:
dataset_json = json.load(dataset_file)
if (dataset_json['version'] != expected_version):
print('Evaluation expects v-' + expected_version +
', but got dataset with v-' + dataset_json['version'],
file=sys.stderr)
dataset = dataset_json['data']
with open(args.prediction_file) as prediction_file:
predictions = json.load(prediction_file)
print(json.dumps(evaluate(dataset, predictions)))
================================================
FILE: script/interactive.py
================================================
#!/usr/bin/env python3
# Copyright 2018-present, HKUST-KnowComp.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
"""A script to run the reader model interactively."""
import sys
sys.path.append('.')
import torch
import code
import argparse
import logging
import prettytable
import time
from predictor import Predictor
from multiprocessing import cpu_count
logger = logging.getLogger()
logger.setLevel(logging.INFO)
fmt = logging.Formatter('%(asctime)s: [ %(message)s ]', '%m/%d/%Y %I:%M:%S %p')
console = logging.StreamHandler()
console.setFormatter(fmt)
logger.addHandler(console)
PREDICTOR = None
# ------------------------------------------------------------------------------
# Drop in to interactive mode
# ------------------------------------------------------------------------------
def process(document, question, candidates=None, top_n=1):
t0 = time.time()
predictions = PREDICTOR.predict(document, question, candidates, top_n)
table = prettytable.PrettyTable(['Rank', 'Span', 'Score'])
for i, p in enumerate(predictions, 1):
table.add_row([i, p[0], p[1]])
print(table)
print('Time: %.4f' % (time.time() - t0))
banner = """
* WRMCQA interactive Document Reader Module *
* Repo: Mnemonic Reader (https://github.com/HKUST-KnowComp/MnemonicReader)
* Implement based on Facebook's DrQA
>>> process(document, question, candidates=None, top_n=1)
>>> usage()
"""
def usage():
print(banner)
# ------------------------------------------------------------------------------
# Commandline arguments & init
# ------------------------------------------------------------------------------
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--model', type=str, default=None,
help='Path to model to use')
parser.add_argument('--embedding-file', type=str, default=None,
help=('Expand dictionary to use all pretrained '
'embeddings in this file.'))
parser.add_argument('--char-embedding-file', type=str, default=None,
help=('Expand dictionary to use all pretrained '
'char embeddings in this file.'))
parser.add_argument('--num-workers', type=int, default=int(cpu_count()/2),
help='Number of CPU processes (for tokenizing, etc)')
parser.add_argument('--no-cuda', action='store_true',
help='Use CPU only')
parser.add_argument('--gpu', type=int, default=-1,
help='Specify GPU device id to use')
parser.add_argument('--no-normalize', action='store_true',
help='Do not softmax normalize output scores.')
args = parser.parse_args()
args.cuda = not args.no_cuda and torch.cuda.is_available()
if args.cuda:
torch.cuda.set_device(args.gpu)
logger.info('CUDA enabled (GPU %d)' % args.gpu)
else:
logger.info('Running on CPU only.')
PREDICTOR = Predictor(
args.model,
normalize=not args.no_normalize,
embedding_file=args.embedding_file,
char_embedding_file=args.char_embedding_file,
num_workers=args.num_workers,
)
if args.cuda:
PREDICTOR.cuda()
code.interact(banner=banner, local=locals())
================================================
FILE: script/predict.py
================================================
#!/usr/bin/env python3
# Copyright 2017-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
"""A script to make and save model predictions on an input dataset."""
import sys
sys.path.append('.')
import os
import time
import torch
import argparse
import logging
try:
import ujson as json
except ImportError:
import json
from tqdm import tqdm
from predictor import Predictor
from multiprocessing import cpu_count
logger = logging.getLogger()
logger.setLevel(logging.INFO)
fmt = logging.Formatter('%(asctime)s: [ %(message)s ]', '%m/%d/%Y %I:%M:%S %p')
console = logging.StreamHandler()
console.setFormatter(fmt)
logger.addHandler(console)
parser = argparse.ArgumentParser()
parser.add_argument('dataset', type=str, default=None,
help='SQuAD-like dataset to evaluate on')
parser.add_argument('--model', type=str, default=None,
help='Path to model to use')
parser.add_argument('--embedding-file', type=str, default=None,
help=('Expand dictionary to use all pretrained '
'embeddings in this file.'))
parser.add_argument('--char-embedding-file', type=str, default=None,
help=('Expand dictionary to use all pretrained '
'char embeddings in this file.'))
parser.add_argument('--out-dir', type=str, default='data/predict',
help=('Directory to write prediction file to '
'(<dataset>-<model>.preds)'))
parser.add_argument('--num-workers', type=int, default=int(cpu_count()/2),
help='Number of CPU processes (for tokenizing, etc)')
parser.add_argument('--no-cuda', action='store_true',
help='Use CPU only')
parser.add_argument('--gpu', type=int, default=-1,
help='Specify GPU device id to use')
parser.add_argument('--batch-size', type=int, default=128,
help='Example batching size')
parser.add_argument('--top-n', type=int, default=1,
help='Store top N predicted spans per example')
parser.add_argument('--official', type=bool, default=True,
help='Only store single top span instead of top N list')
args = parser.parse_args()
t0 = time.time()
args.cuda = not args.no_cuda and torch.cuda.is_available()
if args.cuda:
torch.cuda.set_device(args.gpu)
logger.info('CUDA enabled (GPU %d)' % args.gpu)
else:
logger.info('Running on CPU only.')
predictor = Predictor(
args.model,
normalize=True,
embedding_file=args.embedding_file,
char_embedding_file=args.char_embedding_file,
num_workers=args.num_workers,
)
if args.cuda:
predictor.cuda()
# ------------------------------------------------------------------------------
# Read in dataset and make predictions.
# ------------------------------------------------------------------------------
examples = []
qids = []
with open(args.dataset) as f:
data = json.load(f)['data']
for article in data:
for paragraph in article['paragraphs']:
context = paragraph['context']
for qa in paragraph['qas']:
qids.append(qa['id'])
examples.append((context, qa['question']))
results = {}
for i in tqdm(range(0, len(examples), args.batch_size)):
predictions = predictor.predict_batch(
examples[i:i + args.batch_size], top_n=args.top_n
)
for j in range(len(predictions)):
# Official eval expects just a qid --> span
if args.official:
results[qids[i + j]] = predictions[j][0][0]
# Otherwise we store top N and scores for debugging.
else:
results[qids[i + j]] = [(p[0], float(p[1])) for p in predictions[j]]
model = os.path.splitext(os.path.basename(args.model or 'default'))[0]
basename = os.path.splitext(os.path.basename(args.dataset))[0]
outfile = os.path.join(args.out_dir, basename + '-' + model + '.preds')
if not os.path.isdir(args.out_dir):
os.mkdir(args.out_dir)
logger.info('Writing results to %s' % outfile)
with open(outfile, 'w') as f:
json.dump(results, f)
logger.info('Total time: %.2f' % (time.time() - t0))
================================================
FILE: script/preprocess.py
================================================
#!/usr/bin/env python3
# Copyright 2017-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
"""Preprocess the SQuAD dataset for training."""
import sys
sys.path.append('.')
import argparse
import os
try:
import ujson as json
except ImportError:
import json
import time
from multiprocessing import Pool, cpu_count
from multiprocessing.util import Finalize
from functools import partial
from spacy_tokenizer import SpacyTokenizer
# ------------------------------------------------------------------------------
# Tokenize + annotate.
# ------------------------------------------------------------------------------
TOK = None
ANNTOTORS = {'lemma', 'pos', 'ner'}
def init():
global TOK
TOK = SpacyTokenizer(annotators=ANNTOTORS)
Finalize(TOK, TOK.shutdown, exitpriority=100)
def tokenize(text):
"""Call the global process tokenizer on the input text."""
global TOK
tokens = TOK.tokenize(text)
output = {
'words': tokens.words(),
'chars': tokens.chars(),
'offsets': tokens.offsets(),
'pos': tokens.pos(),
'lemma': tokens.lemmas(),
'ner': tokens.entities(),
}
return output
# ------------------------------------------------------------------------------
# Process dataset examples
# ------------------------------------------------------------------------------
def load_dataset(path):
"""Load json file and store fields separately."""
with open(path) as f:
data = json.load(f)['data']
output = {'qids': [], 'questions': [], 'answers': [],
'contexts': [], 'qid2cid': []}
for article in data:
for paragraph in article['paragraphs']:
output['contexts'].append(paragraph['context'])
for qa in paragraph['qas']:
output['qids'].append(qa['id'])
output['questions'].append(qa['question'])
output['qid2cid'].append(len(output['contexts']) - 1)
if 'answers' in qa:
output['answers'].append(qa['answers'])
return output
def find_answer(offsets, begin_offset, end_offset):
"""Match token offsets with the char begin/end offsets of the answer."""
start = [i for i, tok in enumerate(offsets) if tok[0] == begin_offset]
end = [i for i, tok in enumerate(offsets) if tok[1] == end_offset]
assert(len(start) <= 1)
assert(len(end) <= 1)
if len(start) == 1 and len(end) == 1:
return start[0], end[0]
def process_dataset(data, tokenizer, workers=None):
"""Iterate processing (tokenize, parse, etc) dataset multithreaded."""
make_pool = partial(Pool, workers, initializer=init)
workers = make_pool(initargs=())
q_tokens = workers.map(tokenize, data['questions'])
workers.close()
workers.join()
workers = make_pool(initargs=())
c_tokens = workers.map(tokenize, data['contexts'])
workers.close()
workers.join()
for idx in range(len(data['qids'])):
question = q_tokens[idx]['words']
question_char = q_tokens[idx]['chars']
qlemma = q_tokens[idx]['lemma']
qpos = q_tokens[idx]['pos']
qner = q_tokens[idx]['ner']
document = c_tokens[data['qid2cid'][idx]]['words']
document_char = c_tokens[data['qid2cid'][idx]]['chars']
offsets = c_tokens[data['qid2cid'][idx]]['offsets']
clemma = c_tokens[data['qid2cid'][idx]]['lemma']
cpos = c_tokens[data['qid2cid'][idx]]['pos']
cner = c_tokens[data['qid2cid'][idx]]['ner']
ans_tokens = []
if len(data['answers']) > 0:
for ans in data['answers'][idx]:
found = find_answer(offsets,
ans['answer_start'],
ans['answer_start'] + len(ans['text']))
if found:
ans_tokens.append(found)
yield {
'id': data['qids'][idx],
'question': question,
'question_char': question_char,
'document': document,
'document_char': document_char,
'offsets': offsets,
'answers': ans_tokens,
'qlemma': qlemma,
'qpos': qpos,
'qner': qner,
'clemma': clemma,
'cpos': cpos,
'cner': cner,
}
# -----------------------------------------------------------------------------
# Commandline options
# -----------------------------------------------------------------------------
parser = argparse.ArgumentParser()
parser.add_argument('data_dir', type=str, help='Path to SQuAD data directory')
parser.add_argument('out_dir', type=str, help='Path to output file dir')
parser.add_argument('--split', type=str, help='Filename for train/dev split')
parser.add_argument('--num-workers', type=int, default=1)
parser.add_argument('--tokenizer', type=str, default='spacy')
args = parser.parse_args()
t0 = time.time()
in_file = os.path.join(args.data_dir, args.split + '.json')
print('Loading dataset %s' % in_file, file=sys.stderr)
dataset = load_dataset(in_file)
out_file = os.path.join(
args.out_dir, '%s-processed-%s.txt' % (args.split, args.tokenizer)
)
print('Will write to file %s' % out_file, file=sys.stderr)
with open(out_file, 'w') as f:
for ex in process_dataset(dataset, args.tokenizer, args.num_workers):
f.write(json.dumps(ex) + '\n')
print('Total time: %.4f (s)' % (time.time() - t0))
================================================
FILE: script/train.py
================================================
#!/usr/bin/env python3
# Copyright 2018-present, HKUST-KnowComp.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
"""Main reader training script."""
import sys
sys.path.append('.')
import argparse
import torch
import numpy as np
try:
import ujson as json
except ImportError:
import json
import os
import subprocess
import logging
import utils, vector, config, data
from model import DocReader
logger = logging.getLogger()
# ------------------------------------------------------------------------------
# Training arguments.
# ------------------------------------------------------------------------------
# Defaults
DATA_DIR = os.path.join('data', 'datasets')
MODEL_DIR = os.path.join('data', 'models')
EMBED_DIR = os.path.join('data', 'embeddings')
def str2bool(v):
return v.lower() in ('yes', 'true', 't', '1', 'y')
def add_train_args(parser):
"""Adds commandline arguments pertaining to training a model. These
are different from the arguments dictating the model architecture.
"""
parser.register('type', 'bool', str2bool)
# Runtime environment
runtime = parser.add_argument_group('Environment')
runtime.add_argument('--no-cuda', type='bool', default=False,
help='Train on CPU, even if GPUs are available.')
runtime.add_argument('--gpu', type=int, default=-1,
help='Run on a specific GPU')
runtime.add_argument('--data-workers', type=int, default=5,
help='Number of subprocesses for data loading')
runtime.add_argument('--parallel', type='bool', default=False,
help='Use DataParallel on all available GPUs')
runtime.add_argument('--random-seed', type=int, default=1013,
help=('Random seed for all numpy/torch/cuda '
'operations (for reproducibility)'))
runtime.add_argument('--num-epochs', type=int, default=40,
help='Train data iterations')
runtime.add_argument('--batch-size', type=int, default=45,
help='Batch size for training')
runtime.add_argument('--test-batch-size', type=int, default=32,
help='Batch size during validation/testing')
# Files
files = parser.add_argument_group('Filesystem')
files.add_argument('--model-dir', type=str, default=MODEL_DIR,
help='Directory for saved models/checkpoints/logs')
files.add_argument('--model-name', type=str, default='',
help='Unique model identifier (.mdl, .txt, .checkpoint)')
files.add_argument('--data-dir', type=str, default=DATA_DIR,
help='Directory of training/validation data')
files.add_argument('--train-file', type=str,
default='SQuAD-v1.1-train-processed-spacy.txt',
help='Preprocessed train file')
files.add_argument('--dev-file', type=str,
default='SQuAD-v1.1-dev-processed-spacy.txt',
help='Preprocessed dev file')
files.add_argument('--dev-json', type=str, default='SQuAD-v1.1-dev.json',
help=('Unprocessed dev file to run validation '
'while training on'))
files.add_argument('--embed-dir', type=str, default=EMBED_DIR,
help='Directory of pre-trained embedding files')
files.add_argument('--embedding-file', type=str,
default='glove.840B.300d.txt',
help='Space-separated pretrained embeddings file')
files.add_argument('--char-embedding-file', type=str,
default='glove.840B.300d-char.txt',
help='Space-separated pretrained embeddings file')
# Saving + loading
save_load = parser.add_argument_group('Saving/Loading')
save_load.add_argument('--checkpoint', type='bool', default=False,
help='Save model + optimizer state after each epoch')
save_load.add_argument('--pretrained', type=str, default='',
help='Path to a pretrained model to warm-start with')
save_load.add_argument('--expand-dictionary', type='bool', default=False,
help='Expand dictionary of pretrained model to ' +
'include training/dev words of new data')
# Data preprocessing
preprocess = parser.add_argument_group('Preprocessing')
preprocess.add_argument('--uncased-question', type='bool', default=False,
help='Question words will be lower-cased')
preprocess.add_argument('--uncased-doc', type='bool', default=False,
help='Document words will be lower-cased')
preprocess.add_argument('--restrict-vocab', type='bool', default=True,
help='Only use pre-trained words in embedding_file')
# General
general = parser.add_argument_group('General')
general.add_argument('--official-eval', type='bool', default=True,
help='Validate with official SQuAD eval')
general.add_argument('--valid-metric', type=str, default='exact_match',
help='The evaluation metric used for model selection: None, exact_match, f1')
general.add_argument('--display-iter', type=int, default=25,
help='Log state after every <display_iter> epochs')
general.add_argument('--sort-by-len', type='bool', default=True,
help='Sort batches by length for speed')
def set_defaults(args):
"""Make sure the commandline arguments are initialized properly."""
# Check critical files exist
args.dev_json = os.path.join(args.data_dir, args.dev_json)
if not os.path.isfile(args.dev_json):
raise IOError('No such file: %s' % args.dev_json)
args.train_file = os.path.join(args.data_dir, args.train_file)
if not os.path.isfile(args.train_file):
raise IOError('No such file: %s' % args.train_file)
args.dev_file = os.path.join(args.data_dir, args.dev_file)
if not os.path.isfile(args.dev_file):
raise IOError('No such file: %s' % args.dev_file)
if args.embedding_file:
args.embedding_file = os.path.join(args.embed_dir, args.embedding_file)
if not os.path.isfile(args.embedding_file):
raise IOError('No such file: %s' % args.embedding_file)
if args.char_embedding_file:
args.char_embedding_file = os.path.join(args.embed_dir, args.char_embedding_file)
if not os.path.isfile(args.char_embedding_file):
raise IOError('No such file: %s' % args.char_embedding_file)
# Set model directory
subprocess.call(['mkdir', '-p', args.model_dir])
# Set model name
if not args.model_name:
import uuid
import time
args.model_name = time.strftime("%Y%m%d-") + str(uuid.uuid4())[:8]
# Set log + model file names
args.log_file = os.path.join(args.model_dir, args.model_name + '.txt')
args.model_file = os.path.join(args.model_dir, args.model_name + '.mdl')
# Embeddings options
if args.embedding_file:
with open(args.embedding_file) as f:
dim = len(f.readline().strip().split(' ')) - 1
args.embedding_dim = dim
elif not args.embedding_dim:
raise RuntimeError('Either embedding_file or embedding_dim '
'needs to be specified.')
if args.char_embedding_file:
with open(args.char_embedding_file) as f:
dim = len(f.readline().strip().split(' ')) - 1
args.char_embedding_dim = dim
elif not args.char_embedding_dim:
raise RuntimeError('Either char_embedding_file or char_embedding_dim '
'needs to be specified.')
# Make sure tune_partial and fix_embeddings are consistent.
if args.tune_partial > 0 and args.fix_embeddings:
logger.warning('WARN: fix_embeddings set to False as tune_partial > 0.')
args.fix_embeddings = False
# Make sure fix_embeddings and embedding_file are consistent
if args.fix_embeddings:
if not (args.embedding_file or args.pretrained):
logger.warning('WARN: fix_embeddings set to False '
'as embeddings are random.')
args.fix_embeddings = False
return args
# ------------------------------------------------------------------------------
# Initalization from scratch.
# ------------------------------------------------------------------------------
def init_from_scratch(args, train_exs, dev_exs):
"""New model, new data, new dictionary."""
# Create a feature dict out of the annotations in the data
logger.info('-' * 100)
logger.info('Generate features')
feature_dict = utils.build_feature_dict(args, train_exs)
logger.info('Num features = %d' % len(feature_dict))
logger.info(feature_dict)
# Build a dictionary from the data questions + documents (train/dev splits)
logger.info('-' * 100)
logger.info('Build word dictionary')
word_dict = utils.build_word_dict(args, train_exs + dev_exs)
logger.info('Num words = %d' % len(word_dict))
# Build a char dictionary from the data questions + documents (train/dev splits)
logger.info('-' * 100)
logger.info('Build char dictionary')
char_dict = utils.build_char_dict(args, train_exs + dev_exs)
logger.info('Num chars = %d' % len(char_dict))
# Initialize model
model = DocReader(config.get_model_args(args), word_dict, char_dict, feature_dict)
# Load pretrained embeddings for words in dictionary
if args.embedding_file:
model.load_embeddings(word_dict.tokens(), args.embedding_file)
if args.char_embedding_file:
model.load_char_embeddings(char_dict.tokens(), args.char_embedding_file)
return model
# ------------------------------------------------------------------------------
# Train loop.
# ------------------------------------------------------------------------------
def train(args, data_loader, model, global_stats):
"""Run through one epoch of model training with the provided data loader."""
# Initialize meters + timers
train_loss = utils.AverageMeter()
epoch_time = utils.Timer()
# Run one epoch
for idx, ex in enumerate(data_loader):
train_loss.update(*model.update(ex))
if idx % args.display_iter == 0:
logger.info('train: Epoch = %d | iter = %d/%d | ' %
(global_stats['epoch'], idx, len(data_loader)) +
'loss = %.2f | elapsed time = %.2f (s)' %
(train_loss.avg, global_stats['timer'].time()))
train_loss.reset()
logger.info('train: Epoch %d done. Time for epoch = %.2f (s)' %
(global_stats['epoch'], epoch_time.time()))
# Checkpoint
if args.checkpoint:
model.checkpoint(args.model_file + '.checkpoint',
global_stats['epoch'] + 1)
# ------------------------------------------------------------------------------
# Validation loops. Includes both "unofficial" and "official" functions that
# use different metrics and implementations.
# ------------------------------------------------------------------------------
def validate_unofficial(args, data_loader, model, global_stats, mode):
"""Run one full unofficial validation.
Unofficial = doesn't use SQuAD script.
"""
eval_time = utils.Timer()
start_acc = utils.AverageMeter()
end_acc = utils.AverageMeter()
exact_match = utils.AverageMeter()
# Make predictions
examples = 0
for ex in data_loader:
batch_size = ex[0].size(0)
pred_s, pred_e, _ = model.predict(ex)
target_s, target_e = ex[-3:-1]
# We get metrics for independent start/end and joint start/end
accuracies = eval_accuracies(pred_s, target_s, pred_e, target_e)
start_acc.update(accuracies[0], batch_size)
end_acc.update(accuracies[1], batch_size)
exact_match.update(accuracies[2], batch_size)
# If getting train accuracies, sample max 10k
examples += batch_size
if mode == 'train' and examples >= 1e4:
break
logger.info('%s valid unofficial: Epoch = %d | start = %.2f | ' %
(mode, global_stats['epoch'], start_acc.avg) +
'end = %.2f | exact = %.2f | examples = %d | ' %
(end_acc.avg, exact_match.avg, examples) +
'valid time = %.2f (s)' % eval_time.time())
return {'exact_match': exact_match.avg}
def validate_official(args, data_loader, model, global_stats,
offsets, texts, answers):
"""Run one full official validation. Uses exact spans and same
exact match/F1 score computation as in the SQuAD script.
Extra arguments:
offsets: The character start/end indices for the tokens in each context.
texts: Map of qid --> raw text of examples context (matches offsets).
answers: Map of qid --> list of accepted answers.
"""
eval_time = utils.Timer()
f1 = utils.AverageMeter()
exact_match = utils.AverageMeter()
# Run through examples
examples = 0
for ex in data_loader:
ex_id, batch_size = ex[-1], ex[0].size(0)
pred_s, pred_e, _ = model.predict(ex)
for i in range(batch_size):
s_offset = offsets[ex_id[i]][pred_s[i][0]][0]
e_offset = offsets[ex_id[i]][pred_e[i][0]][1]
prediction = texts[ex_id[i]][s_offset:e_offset]
# Compute metrics
ground_truths = answers[ex_id[i]]
exact_match.update(utils.metric_max_over_ground_truths(
utils.exact_match_score, prediction, ground_truths))
f1.update(utils.metric_max_over_ground_truths(
utils.f1_score, prediction, ground_truths))
examples += batch_size
logger.info('dev valid official: Epoch = %d | EM = %.2f | ' %
(global_stats['epoch'], exact_match.avg * 100) +
'F1 = %.2f | examples = %d | valid time = %.2f (s)' %
(f1.avg * 100, examples, eval_time.time()))
return {'exact_match': exact_match.avg * 100, 'f1': f1.avg * 100}
def eval_accuracies(pred_s, target_s, pred_e, target_e):
"""An unofficial evalutation helper.
Compute exact start/end/complete match accuracies for a batch.
"""
# Convert 1D tensors to lists of lists (compatibility)
if torch.is_tensor(target_s):
target_s = [[e] for e in target_s]
target_e = [[e] for e in target_e]
# Compute accuracies from targets
batch_size = len(pred_s)
start = utils.AverageMeter()
end = utils.AverageMeter()
em = utils.AverageMeter()
for i in range(batch_size):
# Start matches
if pred_s[i] in target_s[i]:
start.update(1)
else:
start.update(0)
# End matches
if pred_e[i] in target_e[i]:
end.update(1)
else:
end.update(0)
# Both start and end match
if any([1 for _s, _e in zip(target_s[i], target_e[i])
if _s == torch.from_numpy(pred_s[i]) and _e == torch.from_numpy(pred_e[i])]):
em.update(1)
else:
em.update(0)
return start.avg * 100, end.avg * 100, em.avg * 100
# ------------------------------------------------------------------------------
# Main.
# ------------------------------------------------------------------------------
def main(args):
# --------------------------------------------------------------------------
# DATA
logger.info('-' * 100)
logger.info('Load data files')
train_exs = utils.load_data(args, args.train_file, skip_no_answer=True)
logger.info('Num train examples = %d' % len(train_exs))
dev_exs = utils.load_data(args, args.dev_file)
logger.info('Num dev examples = %d' % len(dev_exs))
# If we are doing offician evals then we need to:
# 1) Load the original text to retrieve spans from offsets.
# 2) Load the (multiple) text answers for each question.
if args.official_eval:
dev_texts = utils.load_text(args.dev_json)
dev_offsets = {ex['id']: ex['offsets'] for ex in dev_exs}
dev_answers = utils.load_answers(args.dev_json)
# --------------------------------------------------------------------------
# MODEL
logger.info('-' * 100)
start_epoch = 0
if args.checkpoint and os.path.isfile(args.model_file + '.checkpoint'):
# Just resume training, no modifications.
logger.info('Found a checkpoint...')
checkpoint_file = args.model_file + '.checkpoint'
model, start_epoch = DocReader.load_checkpoint(checkpoint_file, args)
else:
# Training starts fresh. But the model state is either pretrained or
# newly (randomly) initialized.
if args.pretrained:
logger.info('Using pretrained model...')
model = DocReader.load(args.pretrained, args)
if args.expand_dictionary:
logger.info('Expanding dictionary for new data...')
# Add words in training + dev examples
words = utils.load_words(args, train_exs + dev_exs)
added_words = model.expand_dictionary(words)
# Load pretrained embeddings for added words
if args.embedding_file:
model.load_embeddings(added_words, args.embedding_file)
logger.info('Expanding char dictionary for new data...')
# Add words in training + dev examples
chars = utils.load_chars(args, train_exs + dev_exs)
added_chars = model.expand_char_dictionary(chars)
# Load pretrained embeddings for added words
if args.char_embedding_file:
model.load_char_embeddings(added_chars, args.char_embedding_file)
else:
logger.info('Training model from scratch...')
model = init_from_scratch(args, train_exs, dev_exs)
# Set up partial tuning of embeddings
if args.tune_partial > 0:
logger.info('-' * 100)
logger.info('Counting %d most frequent question words' %
args.tune_partial)
top_words = utils.top_question_words(
args, train_exs, model.word_dict
)
for word in top_words[:5]:
logger.info(word)
logger.info('...')
for word in top_words[-6:-1]:
logger.info(word)
model.tune_embeddings([w[0] for w in top_words])
# Set up optimizer
model.init_optimizer()
# Use the GPU?
if args.cuda:
model.cuda()
# Use multiple GPUs?
if args.parallel:
model.parallelize()
# --------------------------------------------------------------------------
# DATA ITERATORS
# Two datasets: train and dev. If we sort by length it's faster.
logger.info('-' * 100)
logger.info('Make data loaders')
train_dataset = data.ReaderDataset(train_exs, model, single_answer=True)
if args.sort_by_len:
train_sampler = data.SortedBatchSampler(train_dataset.lengths(),
args.batch_size,
shuffle=True)
else:
train_sampler = torch.utils.data.sampler.RandomSampler(train_dataset)
train_loader = torch.utils.data.DataLoader(
train_dataset,
batch_size=args.batch_size,
sampler=train_sampler,
num_workers=args.data_workers,
collate_fn=vector.batchify,
pin_memory=args.cuda,
)
dev_dataset = data.ReaderDataset(dev_exs, model, single_answer=False)
if args.sort_by_len:
dev_sampler = data.SortedBatchSampler(dev_dataset.lengths(),
args.test_batch_size,
shuffle=False)
else:
dev_sampler = torch.utils.data.sampler.SequentialSampler(dev_dataset)
dev_loader = torch.utils.data.DataLoader(
dev_dataset,
batch_size=args.test_batch_size,
sampler=dev_sampler,
num_workers=args.data_workers,
collate_fn=vector.batchify,
pin_memory=args.cuda,
)
# -------------------------------------------------------------------------
# PRINT CONFIG
logger.info('-' * 100)
logger.info('CONFIG:\n%s' %
json.dumps(vars(args), indent=4, sort_keys=True))
# --------------------------------------------------------------------------
# TRAIN/VALID LOOP
logger.info('-' * 100)
logger.info('Starting training...')
stats = {'timer': utils.Timer(), 'epoch': 0, 'best_valid': 0}
for epoch in range(start_epoch, args.num_epochs):
stats['epoch'] = epoch
# Train
train(args, train_loader, model, stats)
# Validate unofficial (train)
validate_unofficial(args, train_loader, model, stats, mode='train')
# Validate unofficial (dev)
result = validate_unofficial(args, dev_loader, model, stats, mode='dev')
# Validate official
if args.official_eval:
result = validate_official(args, dev_loader, model, stats,
dev_offsets, dev_texts, dev_answers)
# Save best valid
if args.valid_metric is None or args.valid_metric == 'None':
model.save(args.model_file)
elif result[args.valid_metric] > stats['best_valid']:
logger.info('Best valid: %s = %.2f (epoch %d, %d updates)' %
(args.valid_metric, result[args.valid_metric],
stats['epoch'], model.updates))
model.save(args.model_file)
stats['best_valid'] = result[args.valid_metric]
if __name__ == '__main__':
# Parse cmdline args and setup environment
parser = argparse.ArgumentParser(
'WRMCQA Document Reader',
formatter_class=argparse.ArgumentDefaultsHelpFormatter
)
add_train_args(parser)
config.add_model_args(parser)
args = parser.parse_args()
set_defaults(args)
# Set cuda
args.cuda = not args.no_cuda and torch.cuda.is_available()
if args.cuda:
torch.cuda.set_device(args.gpu)
# Set random state
np.random.seed(args.random_seed)
torch.manual_seed(args.random_seed)
if args.cuda:
torch.cuda.manual_seed(args.random_seed)
# Set logging
logger.setLevel(logging.INFO)
fmt = logging.Formatter('%(asctime)s: [ %(message)s ]',
'%m/%d/%Y %I:%M:%S %p')
console = logging.StreamHandler()
console.setFormatter(fmt)
logger.addHandler(console)
if args.log_file:
if args.checkpoint:
logfile = logging.FileHandler(args.log_file, 'a')
else:
logfile = logging.FileHandler(args.log_file, 'w')
logfile.setFormatter(fmt)
logger.addHandler(logfile)
logger.info('COMMAND: %s' % ' '.join(sys.argv))
print(args)
# Run!
main(args)
================================================
FILE: spacy_tokenizer.py
================================================
#!/usr/bin/env python3
# Copyright 2018-present, HKUST-KnowComp.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
"""Tokenizer that is backed by spaCy (spacy.io).
Requires spaCy package and the spaCy english model.
"""
import spacy
import copy
class Tokens(object):
"""A class to represent a list of tokenized text."""
TEXT = 0
CHAR = 1
TEXT_WS = 2
SPAN = 3
POS = 4
LEMMA = 5
NER = 6
def __init__(self, data, annotators, opts=None):
self.data = data
self.annotators = annotators
self.opts = opts or {}
def __len__(self):
"""The number of tokens."""
return len(self.data)
def slice(self, i=None, j=None):
"""Return a view of the list of tokens from [i, j)."""
new_tokens = copy.copy(self)
new_tokens.data = self.data[i: j]
return new_tokens
def untokenize(self):
"""Returns the original text (with whitespace reinserted)."""
return ''.join([t[self.TEXT_WS] for t in self.data]).strip()
def chars(self, uncased=False):
"""Returns a list of the first character of each token
Args:
uncased: lower cases characters
"""
if uncased:
return [[c.lower() for c in t[self.CHAR]] for t in self.data]
else:
return [[c for c in t[self.CHAR]] for t in self.data]
def words(self, uncased=False):
"""Returns a list of the text of each token
Args:
uncased: lower cases text
"""
if uncased:
return [t[self.TEXT].lower() for t in self.data]
else:
return [t[self.TEXT] for t in self.data]
def offsets(self):
"""Returns a list of [start, end) character offsets of each token."""
return [t[self.SPAN] for t in self.data]
def pos(self):
"""Returns a list of part-of-speech tags of each token.
Returns None if this annotation was not included.
"""
if 'pos' not in self.annotators:
return None
return [t[self.POS] for t in self.data]
def lemmas(self):
"""Returns a list of the lemmatized text of each token.
Returns None if this annotation was not included.
"""
if 'lemma' not in self.annotators:
return None
return [t[self.LEMMA] for t in self.data]
def entities(self):
"""Returns a list of named-entity-recognition tags of each token.
Returns None if this annotation was not included.
"""
if 'ner' not in self.annotators:
return None
return [t[self.NER] for t in self.data]
def ngrams(self, n=1, uncased=False, filter_fn=None, as_strings=True):
"""Returns a list of all ngrams from length 1 to n.
Args:
n: upper limit of ngram length
uncased: lower cases text
filter_fn: user function that takes in an ngram list and returns
True or False to keep or not keep the ngram
as_string: return the ngram as a string vs list
"""
def _skip(gram):
if not filter_fn:
return False
return filter_fn(gram)
words = self.words(uncased)
ngrams = [(s, e + 1)
for s in range(len(words))
for e in range(s, min(s + n, len(words)))
if not _skip(words[s:e + 1])]
# Concatenate into strings
if as_strings:
ngrams = ['{}'.format(' '.join(words[s:e])) for (s, e) in ngrams]
return ngrams
def entity_groups(self):
"""Group consecutive entity tokens with the same NER tag."""
entities = self.entities()
if not entities:
return None
non_ent = self.opts.get('non_ent', 'O')
groups = []
idx = 0
while idx < len(entities):
ner_tag = entities[idx]
# Check for entity tag
if ner_tag != non_ent:
# Chomp the sequence
start = idx
while (idx < len(entities) and entities[idx] == ner_tag):
idx += 1
groups.append((self.slice(start, idx).untokenize(), ner_tag))
else:
idx += 1
return groups
class SpacyTokenizer(object):
def __init__(self, **kwargs):
"""
Args:
annotators: set that can include pos, lemma, and ner.
model: spaCy model to use (either path, or keyword like 'en').
"""
model = kwargs.get('model', 'en')
self.annotators = copy.deepcopy(kwargs.get('annotators', set()))
self.nlp = spacy.load(model)
self.nlp.remove_pipe('parser')
if not any([p in self.annotators for p in ['lemma', 'pos', 'ner']]):
self.nlp.remove_pipe('tagger')
if 'ner' not in self.annotators:
self.nlp.remove_pipe('ner')
def tokenize(self, text):
# We don't treat new lines as tokens.
clean_text = text.replace('\n', ' ')
tokens = self.nlp(clean_text)
data = []
for i in range(len(tokens)):
# Get whitespace
start_ws = tokens[i].idx
if i + 1 < len(tokens):
end_ws = tokens[i + 1].idx
else:
end_ws = tokens[i].idx + len(tokens[i].text)
data.append((
tokens[i].text,
list(tokens[i].text),
text[start_ws: end_ws],
(tokens[i].idx, tokens[i].idx + len(tokens[i].text)),
tokens[i].tag_,
tokens[i].lemma_,
tokens[i].ent_type_,
))
# Set special option for non-entity tag: '' vs 'O' in spaCy
return Tokens(data, self.annotators, opts={'non_ent': ''})
def shutdown(self):
pass
def __del__(self):
self.shutdown()
================================================
FILE: utils.py
================================================
#!/usr/bin/env python3
# Copyright 2018-present, HKUST-KnowComp.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
"""Reader utilities."""
try:
import ujson as json
except ImportError:
import json
import time
import logging
import string
try:
import regex as re
except ImportError:
import re
from collections import Counter
from data import Dictionary
logger = logging.getLogger(__name__)
# ------------------------------------------------------------------------------
# Data loading
# ------------------------------------------------------------------------------
def load_data(args, filename, skip_no_answer=False):
"""Load examples from preprocessed file.
One example per line, JSON encoded.
"""
# Load JSON lines
with open(filename) as f:
examples = [json.loads(line) for line in f]
# Make case insensitive?
if args.uncased_question or args.uncased_doc:
for ex in examples:
if args.uncased_question:
ex['question'] = [w.lower() for w in ex['question']]
ex['question_char'] = [w.lower() for w in ex['question_char']]
if args.uncased_doc:
ex['document'] = [w.lower() for w in ex['document']]
ex['document_char'] = [w.lower() for w in ex['document_char']]
# Skip unparsed (start/end) examples
if skip_no_answer:
examples = [ex for ex in examples if len(ex['answers']) > 0]
return examples
def load_text(filename):
"""Load the paragraphs only of a SQuAD dataset. Store as qid -> text."""
# Load JSON file
with open(filename) as f:
examples = json.load(f)['data']
texts = {}
for article in examples:
for paragraph in article['paragraphs']:
for qa in paragraph['qas']:
texts[qa['id']] = paragraph['context']
return texts
def load_answers(filename):
"""Load the answers only of a SQuAD dataset. Store as qid -> [answers]."""
# Load JSON file
with open(filename) as f:
examples = json.load(f)['data']
ans = {}
for article in examples:
for paragraph in article['paragraphs']:
for qa in paragraph['qas']:
ans[qa['id']] = list(map(lambda x: x['text'], qa['answers']))
return ans
# ------------------------------------------------------------------------------
# Dictionary building
# ------------------------------------------------------------------------------
def index_embedding_words(embedding_file):
"""Put all the words in embedding_file into a set."""
words = set()
with open(embedding_file) as f:
for line in f:
w = Dictionary.normalize(line.rstrip().split(' ')[0])
words.add(w)
return words
def load_words(args, examples):
"""Iterate and index all the words in examples (documents + questions)."""
def _insert(iterable):
for w in iterable:
w = Dictionary.normalize(w)
if valid_words and w not in valid_words:
continue
words.add(w)
if args.restrict_vocab and args.embedding_file:
logger.info('Restricting to words in %s' % args.embedding_file)
valid_words = index_embedding_words(args.embedding_file)
logger.info('Num words in set = %d' % len(valid_words))
else:
valid_words = None
words = set()
for ex in examples:
_insert(ex['question'])
_insert(ex['document'])
return words
def build_word_dict(args, examples):
"""Return a word dictionary from question and document words in
provided examples.
"""
word_dict = Dictionary()
for w in load_words(args, examples):
word_dict.add(w)
return word_dict
def index_embedding_chars(char_embedding_file):
"""Put all the chars in char_embedding_file into a set."""
chars = set()
with open(char_embedding_file) as f:
for line in f:
c = Dictionary.normalize(line.rstrip().split(' ')[0])
chars.add(c)
return chars
def load_chars(args, examples):
"""Iterate and index all the chars in examples (documents + questions)."""
def _insert(iterable):
for cs in iterable:
for c in cs:
c = Dictionary.normalize(c)
if valid_chars and c not in valid_chars:
continue
chars.add(c)
if args.restrict_vocab and args.char_embedding_file:
logger.info('Restricting to chars in %s' % args.char_embedding_file)
valid_chars = index_embedding_chars(args.char_embedding_file)
logger.info('Num chars in set = %d' % len(valid_chars))
else:
valid_chars = None
chars = set()
for ex in examples:
_insert(ex['question_char'])
_insert(ex['document_char'])
return chars
def build_char_dict(args, examples):
"""Return a char dictionary from question and document words in
provided examples.
"""
char_dict = Dictionary()
for c in load_chars(args, examples):
char_dict.add(c)
return char_dict
def top_question_words(args, examples, word_dict):
"""Count and return the most common question words in provided examples."""
word_count = Counter()
for ex in examples:
for w in ex['question']:
w = Dictionary.normalize(w)
if w in word_dict:
word_count.update([w])
return word_count.most_common(args.tune_partial)
def build_feature_dict(args, examples):
"""Index features (one hot) from fields in examples and options."""
def _insert(feature):
if feature not in feature_dict:
feature_dict[feature] = len(feature_dict)
feature_dict = {}
# Exact match features
if args.use_exact_match:
_insert('in_cased')
_insert('in_uncased')
if args.use_lemma:
_insert('in_lemma')
# Part of speech tag features
if args.use_pos:
for ex in examples:
for w in ex['cpos']:
_insert('pos=%s' % w)
for w in ex['qpos']:
_insert('pos=%s' % w)
# Named entity tag features
if args.use_ner:
for ex in examples:
for w in ex['cner']:
_insert('ner=%s' % w)
for w in ex['qner']:
_insert('ner=%s' % w)
# Term frequency feature
if args.use_tf:
_insert('tf')
return feature_dict
# ------------------------------------------------------------------------------
# Evaluation. Follows official evalutation script for v1.1 of the SQuAD dataset.
# ------------------------------------------------------------------------------
def normalize_answer(s):
"""Lower text and remove punctuation, articles and extra whitespace."""
def remove_articles(text):
return re.sub(r'\b(a|an|the)\b', ' ', text)
def white_space_fix(text):
return ' '.join(text.split())
def remove_punc(text):
exclude = set(string.punctuation)
return ''.join(ch for ch in text if ch not in exclude)
def lower(text):
return text.lower()
return white_space_fix(remove_articles(remove_punc(lower(s))))
def f1_score(prediction, ground_truth):
"""Compute the geometric mean of precision and recall for answer tokens."""
prediction_tokens = normalize_answer(prediction).split()
ground_truth_tokens = normalize_answer(ground_truth).split()
common = Counter(prediction_tokens) & Counter(ground_truth_tokens)
num_same = sum(common.values())
if num_same == 0:
return 0
precision = 1.0 * num_same / len(prediction_tokens)
recall = 1.0 * num_same / len(ground_truth_tokens)
f1 = (2 * precision * recall) / (precision + recall)
return f1
def exact_match_score(prediction, ground_truth):
"""Check if the prediction is a (soft) exact match with the ground truth."""
return normalize_answer(prediction) == normalize_answer(ground_truth)
def regex_match_score(prediction, pattern):
"""Check if the prediction matches the given regular expression."""
try:
compiled = re.compile(
pattern,
flags=re.IGNORECASE + re.UNICODE + re.MULTILINE
)
except BaseException:
logger.warn('Regular expression failed to compile: %s' % pattern)
return False
return compiled.match(prediction) is not None
def metric_max_over_ground_truths(metric_fn, prediction, ground_truths):
"""Given a prediction and multiple valid answers, return the score of
the best prediction-answer_n pair given a metric function.
"""
scores_for_ground_truths = []
for ground_truth in ground_truths:
score = metric_fn(prediction, ground_truth)
scores_for_ground_truths.append(score)
return max(scores_for_ground_truths)
# ------------------------------------------------------------------------------
# Utility classes
# ------------------------------------------------------------------------------
class AverageMeter(object):
"""Computes and stores the average and current value."""
def __init__(self):
self.reset()
def reset(self):
self.val = 0
self.avg = 0
self.sum = 0
self.count = 0
def update(self, val, n=1):
self.val = val
self.sum += val * n
self.count += n
self.avg = self.sum / self.count
class Timer(object):
"""Computes elapsed time."""
def __init__(self):
self.running = True
self.total = 0
self.start = time.time()
def reset(self):
self.running = True
self.total = 0
self.start = time.time()
return self
def resume(self):
if not self.running:
self.running = True
self.start = time.time()
return self
def stop(self):
if self.running:
self.running = False
self.total += time.time() - self.start
return self
def time(self):
if self.running:
return self.total + time.time() - self.start
return self.total
================================================
FILE: vector.py
================================================
#!/usr/bin/env python3
# Copyright 2018-present, HKUST-KnowComp.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
"""Functions for putting examples into torch format."""
from collections import Counter
import torch
def vectorize(ex, model, single_answer=False):
"""Torchify a single example."""
args = model.args
word_dict = model.word_dict
char_dict = model.char_dict
feature_dict = model.feature_dict
# Index words
document = torch.LongTensor([word_dict[w] for w in ex['document']])
document_char = [torch.LongTensor([char_dict[c] for c in cs]) for cs in ex['document_char']]
question = torch.LongTensor([word_dict[w] for w in ex['question']])
question_char = [torch.LongTensor([char_dict[c] for c in cs]) for cs in ex['question_char']]
# Create extra features vector
if len(feature_dict) > 0:
c_features = torch.zeros(len(ex['document']), len(feature_dict))
q_features = torch.zeros(len(ex['question']), len(feature_dict))
else:
c_features = None
q_features = None
# f_{exact_match}
if args.use_exact_match:
q_words_cased = {w for w in ex['question']}
q_words_uncased = {w.lower() for w in ex['question']}
q_lemma = {w for w in ex['qlemma']} if args.use_lemma else None
for i in range(len(ex['document'])):
if ex['document'][i] in q_words_cased:
c_features[i][feature_dict['in_cased']] = 1.0
if ex['document'][i].lower() in q_words_uncased:
c_features[i][feature_dict['in_uncased']] = 1.0
if q_lemma and ex['clemma'][i] in q_lemma:
c_features[i][feature_dict['in_lemma']] = 1.0
c_words_cased = {w for w in ex['document']}
c_words_uncased = {w.lower() for w in ex['document']}
c_lemma = {w for w in ex['clemma']} if args.use_lemma else None
for i in range(len(ex['question'])):
if ex['question'][i] in c_words_cased:
q_features[i][feature_dict['in_cased']] = 1.0
if ex['question'][i].lower() in c_words_uncased:
q_features[i][feature_dict['in_uncased']] = 1.0
if c_lemma and ex['qlemma'][i] in c_lemma:
q_features[i][feature_dict['in_lemma']] = 1.0
# f_{token} (POS)
if args.use_pos:
for i, w in enumerate(ex['cpos']):
f = 'pos=%s' % w
if f in feature_dict:
c_features[i][feature_dict[f]] = 1.0
for i, w in enumerate(ex['qpos']):
f = 'pos=%s' % w
if f in feature_dict:
q_features[i][feature_dict[f]] = 1.0
# f_{token} (NER)
if args.use_ner:
for i, w in enumerate(ex['cner']):
f = 'ner=%s' % w
if f in feature_dict:
c_features[i][feature_dict[f]] = 1.0
for i, w in enumerate(ex['qner']):
f = 'ner=%s' % w
if f in feature_dict:
q_features[i][feature_dict[f]] = 1.0
# f_{token} (TF)
if args.use_tf:
counter = Counter([w.lower() for w in ex['document']])
l = len(ex['document'])
for i, w in enumerate(ex['document']):
c_features[i][feature_dict['tf']] = counter[w.lower()] * 1.0 / l
counter = Counter([w.lower() for w in ex['question']])
l = len(ex['question'])
for i, w in enumerate(ex['question']):
q_features[i][feature_dict['tf']] = counter[w.lower()] * 1.0 / l
# Maybe return without target
if 'answers' not in ex:
return document, document_char, c_features, question, question_char, q_features, ex['id']
# ...or with target(s) (might still be empty if answers is empty)
if single_answer:
assert(len(ex['answers']) > 0)
start = torch.LongTensor(1).fill_(ex['answers'][0][0])
end = torch.LongTensor(1).fill_(ex['answers'][0][1])
else:
start = [a[0] for a in ex['answers']]
end = [a[1] for a in ex['answers']]
return document, document_char, c_features, question, question_char, q_features, start, end, ex['id']
def batchify(batch):
"""Gather a batch of individual examples into one batch."""
NUM_INPUTS = 6
NUM_TARGETS = 2
NUM_EXTRA = 1
docs = [ex[0] for ex in batch]
doc_chars = [ex[1] for ex in batch]
c_features = [ex[2] for ex in batch]
questions = [ex[3] for ex in batch]
question_chars = [ex[4] for ex in batch]
q_features = [ex[5] for ex in batch]
ids = [ex[-1] for ex in batch]
# Batch documents and features
max_length = max([d.size(0) for d in docs])
# max_char_length = max([c.size(0) for cs in doc_chars for c in cs])
max_char_length = 13
x1 = torch.LongTensor(len(docs), max_length).zero_()
x1_c = torch.LongTensor(len(docs), max_length, max_char_length).zero_()
x1_mask = torch.ByteTensor(len(docs), max_length).fill_(1)
if c_features[0] is None:
x1_f = None
else:
x1_f = torch.zeros(len(docs), max_length, c_features[0].size(1))
for i, d in enumerate(docs):
x1[i, :d.size(0)].copy_(d)
x1_mask[i, :d.size(0)].fill_(0)
if x1_f is not None:
x1_f[i, :d.size(0)].copy_(c_features[i])
for i, cs in enumerate(doc_chars):
for j, c in enumerate(cs):
c_ = c[:max_char_length]
x1_c[i, j, :c_.size(0)].copy_(c_)
# Batch questions
max_length = max([q.size(0) for q in questions])
x2 = torch.LongTensor(len(questions), max_length).zero_()
x2_c = torch.LongTensor(len(questions), max_length, max_char_length).zero_()
x2_mask = torch.ByteTensor(len(questions), max_length).fill_(1)
if q_features[0] is None:
x2_f = None
else:
x2_f = torch.zeros(len(questions), max_length, q_features[0].size(1))
for i, d in enumerate(questions):
x2[i, :d.size(0)].copy_(d)
x2_mask[i, :d.size(0)].fill_(0)
if x2_f is not None:
x2_f[i, :d.size(0)].copy_(q_features[i])
for i, cs in enumerate(question_chars):
for j, c in enumerate(cs):
c_ = c[:max_char_length]
x2_c[i, j, :c_.size(0)].copy_(c_)
# Maybe return without targets
if len(batch[0]) == NUM_INPUTS + NUM_EXTRA:
return x1, x1_c, x1_f, x1_mask, x2, x2_c, x2_f, x2_mask, ids
elif len(batch[0]) == NUM_INPUTS + NUM_EXTRA + NUM_TARGETS:
# ...Otherwise add targets
if torch.is_tensor(batch[0][NUM_INPUTS]):
y_s = torch.cat([ex[NUM_INPUTS] for ex in batch])
y_e = torch.cat([ex[NUM_INPUTS+1] for ex in batch])
else:
y_s = [ex[NUM_INPUTS] for ex in batch]
y_e = [ex[NUM_INPUTS+1] for ex in batch]
else:
raise RuntimeError('Incorrect number of inputs per example.')
return x1, x1_c, x1_f, x1_mask, x2, x2_c, x2_f, x2_mask, y_s, y_e, ids
gitextract_qftjbr90/ ├── .gitignore ├── LICENSE ├── README.md ├── config.py ├── data.py ├── layers.py ├── m_reader.py ├── model.py ├── predictor.py ├── r_net.py ├── rnn_reader.py ├── script/ │ ├── evaluate-v1.1.py │ ├── interactive.py │ ├── predict.py │ ├── preprocess.py │ └── train.py ├── spacy_tokenizer.py ├── utils.py └── vector.py
SYMBOL INDEX (167 symbols across 15 files)
FILE: config.py
function str2bool (line 29) | def str2bool(v):
function add_model_args (line 33) | def add_model_args(parser):
function get_model_args (line 108) | def get_model_args(args):
function override_model_args (line 120) | def override_model_args(old_args, new_args):
FILE: data.py
class Dictionary (line 25) | class Dictionary(object):
method normalize (line 31) | def normalize(token):
method __init__ (line 34) | def __init__(self):
method __len__ (line 38) | def __len__(self):
method __iter__ (line 41) | def __iter__(self):
method __contains__ (line 44) | def __contains__(self, key):
method __getitem__ (line 50) | def __getitem__(self, key):
method __setitem__ (line 57) | def __setitem__(self, key, item):
method add (line 65) | def add(self, token):
method tokens (line 72) | def tokens(self):
class ReaderDataset (line 88) | class ReaderDataset(Dataset):
method __init__ (line 90) | def __init__(self, examples, model, single_answer=False):
method __len__ (line 95) | def __len__(self):
method __getitem__ (line 98) | def __getitem__(self, index):
method lengths (line 101) | def lengths(self):
class SortedBatchSampler (line 111) | class SortedBatchSampler(Sampler):
method __init__ (line 113) | def __init__(self, lengths, batch_size, shuffle=True):
method __iter__ (line 118) | def __iter__(self):
method __len__ (line 130) | def __len__(self):
FILE: layers.py
class StackedBRNN (line 22) | class StackedBRNN(nn.Module):
method __init__ (line 30) | def __init__(self, input_size, hidden_size, num_layers,
method forward (line 46) | def forward(self, x, x_mask):
method _forward_unpadded (line 70) | def _forward_unpadded(self, x, x_mask):
method _forward_padded (line 105) | def _forward_padded(self, x, x_mask):
class FeedForwardNetwork (line 170) | class FeedForwardNetwork(nn.Module):
method __init__ (line 171) | def __init__(self, input_size, hidden_size, output_size, dropout_rate=0):
method forward (line 177) | def forward(self, x):
class PointerNetwork (line 183) | class PointerNetwork(nn.Module):
method __init__ (line 184) | def __init__(self, x_size, y_size, hidden_size, dropout_rate=0, cell_t...
method init_hiddens (line 194) | def init_hiddens(self, y, y_mask):
method pointer (line 199) | def pointer(self, x, state, x_mask):
method forward (line 218) | def forward(self, x, y, x_mask, y_mask):
class MemoryAnsPointer (line 226) | class MemoryAnsPointer(nn.Module):
method __init__ (line 227) | def __init__(self, x_size, y_size, hidden_size, hop=1, dropout_rate=0,...
method forward (line 243) | def forward(self, x, y, x_mask, y_mask):
class SeqAttnMatch (line 283) | class SeqAttnMatch(nn.Module):
method __init__ (line 290) | def __init__(self, input_size, identity=False):
method forward (line 297) | def forward(self, x, y, y_mask):
class SelfAttnMatch (line 330) | class SelfAttnMatch(nn.Module):
method __init__ (line 337) | def __init__(self, input_size, identity=False, diag=True):
method forward (line 345) | def forward(self, x, x_mask):
class BilinearSeqAttn (line 379) | class BilinearSeqAttn(nn.Module):
method __init__ (line 387) | def __init__(self, x_size, y_size, identity=False, normalize=True):
method forward (line 397) | def forward(self, x, y, x_mask):
class LinearSeqAttn (line 421) | class LinearSeqAttn(nn.Module):
method __init__ (line 427) | def __init__(self, input_size):
method forward (line 431) | def forward(self, x, x_mask):
class NonLinearSeqAttn (line 445) | class NonLinearSeqAttn(nn.Module):
method __init__ (line 451) | def __init__(self, input_size, hidden_size):
method forward (line 455) | def forward(self, x, x_mask):
class Gate (line 473) | class Gate(nn.Module):
method __init__ (line 478) | def __init__(self, input_size):
method forward (line 482) | def forward(self, x):
class SFU (line 495) | class SFU(nn.Module):
method __init__ (line 500) | def __init__(self, input_size, fusion_size):
method forward (line 505) | def forward(self, x, fusions):
function uniform_weights (line 518) | def uniform_weights(x, x_mask):
function weighted_avg (line 535) | def weighted_avg(x, weights):
FILE: m_reader.py
class MnemonicReader (line 21) | class MnemonicReader(nn.Module):
method __init__ (line 24) | def __init__(self, args, normalize=True):
method forward (line 105) | def forward(self, x1, x1_c, x1_f, x1_mask, x2, x2_c, x2_f, x2_mask):
FILE: model.py
class DocReader (line 26) | class DocReader(object):
method __init__ (line 35) | def __init__(self, args, word_dict, char_dict, feature_dict,
method expand_dictionary (line 70) | def expand_dictionary(self, words):
method expand_char_dictionary (line 101) | def expand_char_dictionary(self, chars):
method load_embeddings (line 131) | def load_embeddings(self, words, embedding_file):
method load_char_embeddings (line 169) | def load_char_embeddings(self, chars, char_embedding_file):
method tune_embeddings (line 207) | def tune_embeddings(self, words):
method init_optimizer (line 248) | def init_optimizer(self, state_dict=None):
method update (line 277) | def update(self, ex):
method reset_parameters (line 318) | def reset_parameters(self):
method predict (line 338) | def predict(self, ex, candidates=None, top_n=1, async_pool=None):
method decode (line 389) | def decode(score_s, score_e, top_n=1, max_len=None):
method decode_candidates (line 427) | def decode_candidates(score_s, score_e, candidates, top_n=1, max_len=N...
method save (line 480) | def save(self, filename):
method checkpoint (line 496) | def checkpoint(self, filename, epoch):
method load (line 512) | def load(filename, new_args=None, normalize=True):
method load_checkpoint (line 531) | def load_checkpoint(filename, normalize=True):
method cuda (line 551) | def cuda(self):
method cpu (line 555) | def cpu(self):
method parallelize (line 559) | def parallelize(self):
FILE: predictor.py
function init (line 28) | def init(options):
function tokenize (line 34) | def tokenize(text):
function get_annotators_for_model (line 38) | def get_annotators_for_model(model):
class Predictor (line 54) | class Predictor(object):
method __init__ (line 57) | def __init__(self, model, normalize=True,
method predict (line 94) | def predict(self, document, question, candidates=None, top_n=1):
method predict_batch (line 99) | def predict_batch(self, batch, top_n=1):
method cuda (line 153) | def cuda(self):
method cpu (line 156) | def cpu(self):
FILE: r_net.py
class R_Net (line 21) | class R_Net(nn.Module):
method __init__ (line 24) | def __init__(self, args, normalize=True):
method forward (line 124) | def forward(self, x1, x1_c, x1_f, x1_mask, x2, x2_c, x2_f, x2_mask):
FILE: rnn_reader.py
class RnnDocReader (line 19) | class RnnDocReader(nn.Module):
method __init__ (line 22) | def __init__(self, args, normalize=True):
method forward (line 92) | def forward(self, x1, x1_c, x1_f, x1_mask, x2, x2_c, x2_f, x2_mask):
FILE: script/evaluate-v1.1.py
function normalize_answer (line 11) | def normalize_answer(s):
function f1_score (line 29) | def f1_score(prediction, ground_truth):
function exact_match_score (line 42) | def exact_match_score(prediction, ground_truth):
function metric_max_over_ground_truths (line 46) | def metric_max_over_ground_truths(metric_fn, prediction, ground_truths):
function evaluate (line 54) | def evaluate(dataset, predictions):
FILE: script/interactive.py
function process (line 35) | def process(document, question, candidates=None, top_n=1):
function usage (line 57) | def usage():
FILE: script/preprocess.py
function init (line 32) | def init():
function tokenize (line 38) | def tokenize(text):
function load_dataset (line 58) | def load_dataset(path):
function find_answer (line 76) | def find_answer(offsets, begin_offset, end_offset):
function process_dataset (line 86) | def process_dataset(data, tokenizer, workers=None):
FILE: script/train.py
function str2bool (line 39) | def str2bool(v):
function add_train_args (line 43) | def add_train_args(parser):
function set_defaults (line 125) | def set_defaults(args):
function init_from_scratch (line 194) | def init_from_scratch(args, train_exs, dev_exs):
function train (line 231) | def train(args, data_loader, model, global_stats):
function validate_unofficial (line 263) | def validate_unofficial(args, data_loader, model, global_stats, mode):
function validate_official (line 299) | def validate_official(args, data_loader, model, global_stats,
function eval_accuracies (line 341) | def eval_accuracies(pred_s, target_s, pred_e, target_e):
function main (line 382) | def main(args):
FILE: spacy_tokenizer.py
class Tokens (line 15) | class Tokens(object):
method __init__ (line 25) | def __init__(self, data, annotators, opts=None):
method __len__ (line 30) | def __len__(self):
method slice (line 34) | def slice(self, i=None, j=None):
method untokenize (line 40) | def untokenize(self):
method chars (line 44) | def chars(self, uncased=False):
method words (line 55) | def words(self, uncased=False):
method offsets (line 66) | def offsets(self):
method pos (line 70) | def pos(self):
method lemmas (line 78) | def lemmas(self):
method entities (line 86) | def entities(self):
method ngrams (line 94) | def ngrams(self, n=1, uncased=False, filter_fn=None, as_strings=True):
method entity_groups (line 121) | def entity_groups(self):
class SpacyTokenizer (line 143) | class SpacyTokenizer(object):
method __init__ (line 145) | def __init__(self, **kwargs):
method tokenize (line 161) | def tokenize(self, text):
method shutdown (line 188) | def shutdown(self):
method __del__ (line 191) | def __del__(self):
FILE: utils.py
function load_data (line 33) | def load_data(args, filename, skip_no_answer=False):
function load_text (line 57) | def load_text(filename):
function load_answers (line 71) | def load_answers(filename):
function index_embedding_words (line 90) | def index_embedding_words(embedding_file):
function load_words (line 100) | def load_words(args, examples):
function build_word_dict (line 123) | def build_word_dict(args, examples):
function index_embedding_chars (line 132) | def index_embedding_chars(char_embedding_file):
function load_chars (line 141) | def load_chars(args, examples):
function build_char_dict (line 164) | def build_char_dict(args, examples):
function top_question_words (line 173) | def top_question_words(args, examples, word_dict):
function build_feature_dict (line 184) | def build_feature_dict(args, examples):
function normalize_answer (line 227) | def normalize_answer(s):
function f1_score (line 245) | def f1_score(prediction, ground_truth):
function exact_match_score (line 259) | def exact_match_score(prediction, ground_truth):
function regex_match_score (line 264) | def regex_match_score(prediction, pattern):
function metric_max_over_ground_truths (line 277) | def metric_max_over_ground_truths(metric_fn, prediction, ground_truths):
class AverageMeter (line 293) | class AverageMeter(object):
method __init__ (line 296) | def __init__(self):
method reset (line 299) | def reset(self):
method update (line 305) | def update(self, val, n=1):
class Timer (line 312) | class Timer(object):
method __init__ (line 315) | def __init__(self):
method reset (line 320) | def reset(self):
method resume (line 326) | def resume(self):
method stop (line 332) | def stop(self):
method time (line 338) | def time(self):
FILE: vector.py
function vectorize (line 13) | def vectorize(ex, model, single_answer=False):
function batchify (line 107) | def batchify(batch):
Condensed preview — 19 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (153K chars).
[
{
"path": ".gitignore",
"chars": 46,
"preview": "*.pyc\n*.DS_Store\n*~\ndata/\n*.tar.gz\n*.egg-info\n"
},
{
"path": "LICENSE",
"chars": 1514,
"preview": "BSD 3-Clause License\n\nCopyright (c) 2018, HKUST-KnowComp\nAll rights reserved.\n\nRedistribution and use in source and bina"
},
{
"path": "README.md",
"chars": 5525,
"preview": "# Mnemonic Reader\nThe Mnemonic Reader is a deep learning model for Machine Comprehension task. You can get details from "
},
{
"path": "config.py",
"chars": 6749,
"preview": "#!/usr/bin/env python3\n# Copyright 2018-present, HKUST-KnowComp.\n# All rights reserved.\n#\n# This source code is licensed"
},
{
"path": "data.py",
"chars": 4043,
"preview": "#!/usr/bin/env python3\n# Copyright 2018-present, HKUST-KnowComp.\n# All rights reserved.\n#\n# This source code is licensed"
},
{
"path": "layers.py",
"chars": 18422,
"preview": "#!/usr/bin/env python3\n# Copyright 2018-present, HKUST-KnowComp.\n# All rights reserved.\n#\n# This source code is licensed"
},
{
"path": "m_reader.py",
"chars": 7094,
"preview": "#!/usr/bin/env python3\n# Copyright 2018-present, HKUST-KnowComp.\n# All rights reserved.\n#\n# This source code is licensed"
},
{
"path": "model.py",
"chars": 22418,
"preview": "#!/usr/bin/env python3\n# Copyright 2018-present, HKUST-KnowComp.\n# All rights reserved.\n#\n# This source code is licensed"
},
{
"path": "predictor.py",
"chars": 5615,
"preview": "#!/usr/bin/env python3\n# Copyright 2018-present, HKUST-KnowComp.\n# All rights reserved.\n#\n# This source code is licensed"
},
{
"path": "r_net.py",
"chars": 7666,
"preview": "#!/usr/bin/env python3\n# Copyright 2018-present, HKUST-KnowComp.\n# All rights reserved.\n#\n# This source code is licensed"
},
{
"path": "rnn_reader.py",
"chars": 5291,
"preview": "#!/usr/bin/env python3\n# Copyright 2017-present, Facebook, Inc.\n# All rights reserved.\n#\n# This source code is licensed "
},
{
"path": "script/evaluate-v1.1.py",
"chars": 3419,
"preview": "\"\"\" Official evaluation script for v1.1 of the SQuAD dataset. \"\"\"\nfrom __future__ import print_function\nfrom collections"
},
{
"path": "script/interactive.py",
"chars": 3416,
"preview": "#!/usr/bin/env python3\n# Copyright 2018-present, HKUST-KnowComp.\n# All rights reserved.\n#\n# This source code is licensed"
},
{
"path": "script/predict.py",
"chars": 4266,
"preview": "#!/usr/bin/env python3\n# Copyright 2017-present, Facebook, Inc.\n# All rights reserved.\n#\n# This source code is licensed "
},
{
"path": "script/preprocess.py",
"chars": 5562,
"preview": "#!/usr/bin/env python3\n# Copyright 2017-present, Facebook, Inc.\n# All rights reserved.\n#\n# This source code is licensed "
},
{
"path": "script/train.py",
"chars": 23364,
"preview": "#!/usr/bin/env python3\n# Copyright 2018-present, HKUST-KnowComp.\n# All rights reserved.\n#\n# This source code is licensed"
},
{
"path": "spacy_tokenizer.py",
"chars": 6039,
"preview": "#!/usr/bin/env python3\n# Copyright 2018-present, HKUST-KnowComp.\n# All rights reserved.\n#\n# This source code is licensed"
},
{
"path": "utils.py",
"chars": 10219,
"preview": "#!/usr/bin/env python3\n# Copyright 2018-present, HKUST-KnowComp.\n# All rights reserved.\n#\n# This source code is licensed"
},
{
"path": "vector.py",
"chars": 6975,
"preview": "#!/usr/bin/env python3\n# Copyright 2018-present, HKUST-KnowComp.\n# All rights reserved.\n#\n# This source code is licensed"
}
]
About this extraction
This page contains the full source code of the HKUST-KnowComp/MnemonicReader GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 19 files (144.2 KB), approximately 33.5k tokens, and a symbol index with 167 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.