Showing preview only (1,445K chars total). Download the full file or copy to clipboard to get everything.
Repository: zhpmatrix/BERTem
Branch: master
Commit: 5151c4c304d1
Files: 65
Total size: 1.4 MB
Directory structure:
gitextract_mjpetdbh/
├── LICENSE
├── MANIFEST.in
├── README.md
├── docker/
│ └── Dockerfile
├── examples/
│ ├── bertology.py
│ ├── extract_features.py
│ ├── lm_finetuning/
│ │ ├── README.md
│ │ ├── finetune_on_pregenerated.py
│ │ ├── pregenerate_training_data.py
│ │ └── simple_lm_finetuning.py
│ ├── run_classifier.py
│ ├── run_classifier_dataset_utils.py
│ ├── run_gpt2.py
│ ├── run_openai_gpt.py
│ ├── run_squad.py
│ ├── run_squad_dataset_utils.py
│ ├── run_swag.py
│ ├── run_transfo_xl.py
│ ├── sem_run_classifier.py
│ ├── tacred_run_classifier.py
│ ├── tacred_run_infer.py
│ ├── test.sh
│ └── train.sh
├── hubconf.py
├── hubconfs/
│ ├── bert_hubconf.py
│ ├── gpt2_hubconf.py
│ ├── gpt_hubconf.py
│ └── transformer_xl_hubconf.py
├── notebooks/
│ ├── Comparing-PT-and-TF-models.ipynb
│ ├── Comparing-TF-and-PT-models-MLM-NSP.ipynb
│ ├── Comparing-TF-and-PT-models-SQuAD.ipynb
│ └── Comparing-TF-and-PT-models.ipynb
├── pytorch_pretrained_bert/
│ ├── __init__.py
│ ├── __main__.py
│ ├── convert_gpt2_checkpoint_to_pytorch.py
│ ├── convert_openai_checkpoint_to_pytorch.py
│ ├── convert_pytorch_checkpoint_to_tf.py
│ ├── convert_tf_checkpoint_to_pytorch.py
│ ├── convert_transfo_xl_checkpoint_to_pytorch.py
│ ├── file_utils.py
│ ├── modeling.py
│ ├── modeling_gpt2.py
│ ├── modeling_openai.py
│ ├── modeling_transfo_xl.py
│ ├── modeling_transfo_xl_utilities.py
│ ├── optimization.py
│ ├── optimization_openai.py
│ ├── tokenization.py
│ ├── tokenization_gpt2.py
│ ├── tokenization_openai.py
│ └── tokenization_transfo_xl.py
├── requirements.txt
├── samples/
│ ├── input.txt
│ └── sample_text.txt
├── setup.py
└── tests/
├── conftest.py
├── modeling_gpt2_test.py
├── modeling_openai_test.py
├── modeling_test.py
├── modeling_transfo_xl_test.py
├── optimization_test.py
├── tokenization_gpt2_test.py
├── tokenization_openai_test.py
├── tokenization_test.py
└── tokenization_transfo_xl_test.py
================================================
FILE CONTENTS
================================================
================================================
FILE: LICENSE
================================================
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
================================================
FILE: MANIFEST.in
================================================
include LICENSE
================================================
FILE: README.md
================================================
### 实现说明
主要实现文章前半部分的工作,PyTorch实现,基于[huggingface](https://github.com/huggingface/pytorch-pretrained-BERT)的工作,PyTorch才是世界上最屌的框架,逃。
### 实现参考

### 代码说明
(1)主要修改:[modeling.py](https://github.com/zhpmatrix/BERTem/blob/master/pytorch_pretrained_bert/modeling.py)
output representation: **BertForSequenceClassification**
input representation: **BertEmbeddings**
input和output都实现了多种策略,可以结合具体的任务,找到最佳的组合。
(2)非主要实现:examples下的关于classification的文件
(3)服务部署:基于Flask,可以在本地开启一个服务。具体实现在[tacred\_run\_infer.py](https://github.com/zhpmatrix/BERTem/blob/master/examples/tacred_run_infer.py)中。
(4)代码仅供参考,不提供数据集,不提供预训练模型,不提供训练后的模型(希望理解吧)。
(5)相关工作可以参考[我的博客-神经关系抽取](https://zhpmatrix.github.io/2019/06/30/neural-relation-extraction/),可能比这个代码更有价值一些吧。
### 实现结果:
数据集TACRED上的结果:
|模型序号|输入类型|输出类型|指标类型|P|R|F1|备注|
|------|------|------|------|------|------|------|------|
|0|entity marker|sum(entity start)|micro|**0.68**|**0.63**|**0.65**|**base-model**,lr=3e-5,epoch=3|
||||macro|**0.60**|**0.54**|**0.55**|
|1|entity marker|sum(entity start)|micro|**0.70**|**0.62**|**0.65**|**large-model**,lr=3e-5,epoch=1|
||||macro|**0.63**|**0.52**|**0.55**|
|-1|None|None|micro|**0.69**|**0.66**|**0.67**|手误之后,再也找不到了,尴尬|||
||||macro|**0.58**|**0.50**|**0.53**||||
数据集SemEval2010 Task 8上的结果:
|模型序号|输入类型|输出类型|指标类型|P|R|F1|备注|
|------|------|------|------|------|------|------|------|
|0|entity marker|maxpool(entity emb)+relu|micro|**0.86**|**0.86**|**0.86**|bert-large|
||||macro|**0.82**|**0.83**|**0.82**||||
### 混合精度加速结果
在具体任务上,延续之前的setting,将train和dev合并共同作为新的train集,test集不变。在fp32
和fp16的两种setting下,比较相同batch\_size下,一个epoch的用时或者每个迭代的用时。
|比较方面|fp32|fp16|备注|
|------|------|------|------|
|训练阶段|1.04it/s|4.41it/s|12.76it/s(独占显卡)|
|推断阶段|4.14it/s|8.63it/s||
|测试集指标|0.65/0.55|0.64/0.53|格式:micro/macro|
|模型大小|421M|212M||
================================================
FILE: docker/Dockerfile
================================================
FROM pytorch/pytorch:latest
RUN git clone https://github.com/NVIDIA/apex.git && cd apex && python setup.py install --cuda_ext --cpp_ext
RUN pip install pytorch-pretrained-bert
WORKDIR /workspace
================================================
FILE: examples/bertology.py
================================================
#!/usr/bin/env python3
import os
import argparse
import logging
from datetime import timedelta, datetime
from tqdm import tqdm
import numpy as np
import torch
from torch.utils.data import DataLoader, SequentialSampler, TensorDataset, Subset
from torch.utils.data.distributed import DistributedSampler
from torch.nn import CrossEntropyLoss, MSELoss
from pytorch_pretrained_bert import BertForSequenceClassification, BertTokenizer
from run_classifier_dataset_utils import processors, output_modes, convert_examples_to_features, compute_metrics
logger = logging.getLogger(__name__)
def entropy(p):
plogp = p * torch.log(p)
plogp[p == 0] = 0
return -plogp.sum(dim=-1)
def print_1d_tensor(tensor, prefix=""):
if tensor.dtype != torch.long:
logger.info(prefix + "\t".join(f"{x:.5f}" for x in tensor.cpu().data))
else:
logger.info(prefix + "\t".join(f"{x:d}" for x in tensor.cpu().data))
def print_2d_tensor(tensor):
logger.info("lv, h >\t" + "\t".join(f"{x + 1}" for x in range(len(tensor))))
for row in range(len(tensor)):
print_1d_tensor(tensor[row], prefix=f"layer {row + 1}:\t")
def compute_heads_importance(args, model, eval_dataloader, compute_entropy=True, compute_importance=True, head_mask=None):
""" Example on how to use model outputs to compute:
- head attention entropy (activated by setting output_attentions=True when we created the model
- head importance scores according to http://arxiv.org/abs/1905.10650
(activated by setting keep_multihead_output=True when we created the model)
"""
# Prepare our tensors
n_layers, n_heads = model.bert.config.num_hidden_layers, model.bert.config.num_attention_heads
head_importance = torch.zeros(n_layers, n_heads).to(args.device)
attn_entropy = torch.zeros(n_layers, n_heads).to(args.device)
preds = None
labels = None
tot_tokens = 0.0
for step, batch in enumerate(tqdm(eval_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])):
batch = tuple(t.to(args.device) for t in batch)
input_ids, input_mask, segment_ids, label_ids = batch
# Do a forward pass (not with torch.no_grad() since we need gradients for importance score - see below)
all_attentions, logits = model(input_ids, token_type_ids=segment_ids, attention_mask=input_mask, head_mask=head_mask)
if compute_entropy:
# Update head attention entropy
for layer, attn in enumerate(all_attentions):
masked_entropy = entropy(attn.detach()) * input_mask.float().unsqueeze(1)
attn_entropy[layer] += masked_entropy.sum(-1).sum(0).detach()
if compute_importance:
# Update head importance scores with regards to our loss
# First, backpropagate to populate the gradients
if args.output_mode == "classification":
loss_fct = CrossEntropyLoss()
loss = loss_fct(logits.view(-1, args.num_labels), label_ids.view(-1))
elif args.output_mode == "regression":
loss_fct = MSELoss()
loss = loss_fct(logits.view(-1), label_ids.view(-1))
loss.backward()
# Second, compute importance scores according to http://arxiv.org/abs/1905.10650
multihead_outputs = model.bert.get_multihead_outputs()
for layer, mh_layer_output in enumerate(multihead_outputs):
dot = torch.einsum("bhli,bhli->bhl", [mh_layer_output.grad, mh_layer_output])
head_importance[layer] += dot.abs().sum(-1).sum(0).detach()
# Also store our logits/labels if we want to compute metrics afterwards
if preds is None:
preds = logits.detach().cpu().numpy()
labels = label_ids.detach().cpu().numpy()
else:
preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
labels = np.append(labels, label_ids.detach().cpu().numpy(), axis=0)
tot_tokens += input_mask.float().detach().sum().data
# Normalize
attn_entropy /= tot_tokens
head_importance /= tot_tokens
# Layerwise importance normalization
if not args.dont_normalize_importance_by_layer:
exponent = 2
norm_by_layer = torch.pow(torch.pow(head_importance, exponent).sum(-1), 1/exponent)
head_importance /= norm_by_layer.unsqueeze(-1) + 1e-20
if not args.dont_normalize_global_importance:
head_importance = (head_importance - head_importance.min()) / (head_importance.max() - head_importance.min())
return attn_entropy, head_importance, preds, labels
def run_model():
parser = argparse.ArgumentParser()
parser.add_argument('--model_name_or_path', type=str, default='bert-base-cased-finetuned-mrpc', help='pretrained model name or path to local checkpoint')
parser.add_argument("--task_name", type=str, default='mrpc', help="The name of the task to train.")
parser.add_argument("--data_dir", type=str, required=True, help="The input data dir. Should contain the .tsv files (or other data files) for the task.")
parser.add_argument("--output_dir", type=str, required=True, help="The output directory where the model predictions and checkpoints will be written.")
parser.add_argument("--data_subset", type=int, default=-1, help="If > 0: limit the data to a subset of data_subset instances.")
parser.add_argument("--overwrite_output_dir", action='store_true', help="Whether to overwrite data in output directory")
parser.add_argument("--dont_normalize_importance_by_layer", action='store_true', help="Don't normalize importance score by layers")
parser.add_argument("--dont_normalize_global_importance", action='store_true', help="Don't normalize all importance scores between 0 and 1")
parser.add_argument("--try_masking", action='store_true', help="Whether to try to mask head until a threshold of accuracy.")
parser.add_argument("--masking_threshold", default=0.9, type=float, help="masking threshold in term of metrics"
"(stop masking when metric < threshold * original metric value).")
parser.add_argument("--masking_amount", default=0.1, type=float, help="Amount to heads to masking at each masking step.")
parser.add_argument("--metric_name", default="acc", type=str, help="Metric to use for head masking.")
parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after WordPiece tokenization. \n"
"Sequences longer than this will be truncated, and sequences shorter \n"
"than this will be padded.")
parser.add_argument("--batch_size", default=1, type=int, help="Batch size.")
parser.add_argument("--seed", type=int, default=42)
parser.add_argument("--local_rank", type=int, default=-1, help="local_rank for distributed training on gpus")
parser.add_argument("--no_cuda", action='store_true', help="Whether not to use CUDA when available")
parser.add_argument('--server_ip', type=str, default='', help="Can be used for distant debugging.")
parser.add_argument('--server_port', type=str, default='', help="Can be used for distant debugging.")
args = parser.parse_args()
if args.server_ip and args.server_port:
# Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
import ptvsd
print("Waiting for debugger attach")
ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
ptvsd.wait_for_attach()
# Setup devices and distributed training
if args.local_rank == -1 or args.no_cuda:
args.device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
n_gpu = torch.cuda.device_count()
else:
torch.cuda.set_device(args.local_rank)
args.device = torch.device("cuda", args.local_rank)
n_gpu = 1
torch.distributed.init_process_group(backend='nccl') # Initializes the distributed backend
# Setup logging
logging.basicConfig(level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
logger.info("device: {} n_gpu: {}, distributed: {}".format(args.device, n_gpu, bool(args.local_rank != -1)))
# Set seeds
np.random.seed(args.seed)
torch.random.manual_seed(args.seed)
if n_gpu > 0:
torch.cuda.manual_seed(args.seed)
# Prepare GLUE task
task_name = args.task_name.lower()
processor = processors[task_name]()
label_list = processor.get_labels()
args.output_mode = output_modes[task_name]
args.num_labels = len(label_list)
# Prepare output directory
if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and not args.overwrite_output_dir:
raise ValueError("Output directory ({}) already exists and is not empty.".format(args.output_dir))
if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
os.makedirs(args.output_dir)
# Load model & tokenizer
if args.local_rank not in [-1, 0]:
torch.distributed.barrier() # Make sure only one distributed process download model & vocab
tokenizer = BertTokenizer.from_pretrained(args.model_name_or_path)
# Load a model with all BERTology options on:
# output_attentions => will output attention weights
# keep_multihead_output => will store gradient of attention head outputs for head importance computation
# see: http://arxiv.org/abs/1905.10650
model = BertForSequenceClassification.from_pretrained(args.model_name_or_path,
num_labels=args.num_labels,
output_attentions=True,
keep_multihead_output=True)
if args.local_rank == 0:
torch.distributed.barrier() # Make sure only one distributed process download model & vocab
model.to(args.device)
if args.local_rank != -1:
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True)
model.eval()
# Prepare dataset for the GLUE task
eval_examples = processor.get_dev_examples(args.data_dir)
cached_eval_features_file = os.path.join(args.data_dir, 'dev_{0}_{1}_{2}'.format(
list(filter(None, args.model_name_or_path.split('/'))).pop(), str(args.max_seq_length), str(task_name)))
try:
eval_features = torch.load(cached_eval_features_file)
except:
eval_features = convert_examples_to_features(eval_examples, label_list, args.max_seq_length, tokenizer, args.output_mode)
if args.local_rank in [-1, 0]:
logger.info("Saving eval features to cache file %s", cached_eval_features_file)
torch.save(eval_features, cached_eval_features_file)
all_input_ids = torch.tensor([f.input_ids for f in eval_features], dtype=torch.long)
all_input_mask = torch.tensor([f.input_mask for f in eval_features], dtype=torch.long)
all_segment_ids = torch.tensor([f.segment_ids for f in eval_features], dtype=torch.long)
all_label_ids = torch.tensor([f.label_id for f in eval_features], dtype=torch.long if args.output_mode == "classification" else torch.float)
eval_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids)
if args.data_subset > 0:
eval_data = Subset(eval_data, list(range(min(args.data_subset, len(eval_data)))))
eval_sampler = SequentialSampler(eval_data) if args.local_rank == -1 else DistributedSampler(eval_data)
eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=args.batch_size)
# Print/save training arguments
print(args)
torch.save(args, os.path.join(args.output_dir, 'run_args.bin'))
# Compute head entropy and importance score
attn_entropy, head_importance, _, _ = compute_heads_importance(args, model, eval_dataloader)
# Print/save matrices
np.save(os.path.join(args.output_dir, 'attn_entropy.npy'), attn_entropy.detach().cpu().numpy())
np.save(os.path.join(args.output_dir, 'head_importance.npy'), head_importance.detach().cpu().numpy())
logger.info("Attention entropies")
print_2d_tensor(attn_entropy)
logger.info("Head importance scores")
print_2d_tensor(head_importance)
logger.info("Head ranked by importance scores")
head_ranks = torch.zeros(head_importance.numel(), dtype=torch.long, device=args.device)
head_ranks[head_importance.view(-1).sort(descending=True)[1]] = torch.arange(head_importance.numel(), device=args.device)
head_ranks = head_ranks.view_as(head_importance)
print_2d_tensor(head_ranks)
# Do masking if we want to
if args.try_masking and args.masking_threshold > 0.0 and args.masking_threshold < 1.0:
_, head_importance, preds, labels = compute_heads_importance(args, model, eval_dataloader, compute_entropy=False)
preds = np.argmax(preds, axis=1) if args.output_mode == "classification" else np.squeeze(preds)
original_score = compute_metrics(task_name, preds, labels)[args.metric_name]
logger.info("Pruning: original score: %f, threshold: %f", original_score, original_score * args.masking_threshold)
new_head_mask = torch.ones_like(head_importance)
num_to_mask = max(1, int(new_head_mask.numel() * args.masking_amount))
current_score = original_score
while current_score >= original_score * args.masking_threshold:
head_mask = new_head_mask.clone() # save current head mask
# heads from least important to most - keep only not-masked heads
head_importance[head_mask == 0.0] = float('Inf')
current_heads_to_mask = head_importance.view(-1).sort()[1]
if len(current_heads_to_mask) <= num_to_mask:
break
# mask heads
current_heads_to_mask = current_heads_to_mask[:num_to_mask]
logger.info("Heads to mask: %s", str(current_heads_to_mask.tolist()))
new_head_mask = new_head_mask.view(-1)
new_head_mask[current_heads_to_mask] = 0.0
new_head_mask = new_head_mask.view_as(head_mask)
print_2d_tensor(new_head_mask)
# Compute metric and head importance again
_, head_importance, preds, labels = compute_heads_importance(args, model, eval_dataloader, compute_entropy=False, head_mask=new_head_mask)
preds = np.argmax(preds, axis=1) if args.output_mode == "classification" else np.squeeze(preds)
current_score = compute_metrics(task_name, preds, labels)[args.metric_name]
logger.info("Masking: current score: %f, remaning heads %d (%.1f percents)", current_score, new_head_mask.sum(), new_head_mask.sum()/new_head_mask.numel() * 100)
logger.info("Final head mask")
print_2d_tensor(head_mask)
np.save(os.path.join(args.output_dir, 'head_mask.npy'), head_mask.detach().cpu().numpy())
# Try pruning and test time speedup
# Pruning is like masking but we actually remove the masked weights
before_time = datetime.now()
_, _, preds, labels = compute_heads_importance(args, model, eval_dataloader,
compute_entropy=False, compute_importance=False, head_mask=head_mask)
preds = np.argmax(preds, axis=1) if args.output_mode == "classification" else np.squeeze(preds)
score_masking = compute_metrics(task_name, preds, labels)[args.metric_name]
original_time = datetime.now() - before_time
original_num_params = sum(p.numel() for p in model.parameters())
heads_to_prune = dict((layer, (1 - head_mask[layer].long()).nonzero().tolist()) for layer in range(len(head_mask)))
assert sum(len(h) for h in heads_to_prune.values()) == (1 - head_mask.long()).sum().item()
model.bert.prune_heads(heads_to_prune)
pruned_num_params = sum(p.numel() for p in model.parameters())
before_time = datetime.now()
_, _, preds, labels = compute_heads_importance(args, model, eval_dataloader,
compute_entropy=False, compute_importance=False, head_mask=None)
preds = np.argmax(preds, axis=1) if args.output_mode == "classification" else np.squeeze(preds)
score_pruning = compute_metrics(task_name, preds, labels)[args.metric_name]
new_time = datetime.now() - before_time
logger.info("Pruning: original num of params: %.2e, after pruning %.2e (%.1f percents)", original_num_params, pruned_num_params, pruned_num_params/original_num_params * 100)
logger.info("Pruning: score with masking: %f score with pruning: %f", score_masking, score_pruning)
logger.info("Pruning: speed ratio (new timing / original timing): %f percents", original_time/new_time * 100)
if __name__ == '__main__':
run_model()
================================================
FILE: examples/extract_features.py
================================================
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Extract pre-computed feature vectors from a PyTorch BERT model."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import argparse
import collections
import logging
import json
import re
import torch
from torch.utils.data import TensorDataset, DataLoader, SequentialSampler
from torch.utils.data.distributed import DistributedSampler
from pytorch_pretrained_bert.tokenization import BertTokenizer
from pytorch_pretrained_bert.modeling import BertModel
logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
datefmt = '%m/%d/%Y %H:%M:%S',
level = logging.INFO)
logger = logging.getLogger(__name__)
class InputExample(object):
def __init__(self, unique_id, text_a, text_b):
self.unique_id = unique_id
self.text_a = text_a
self.text_b = text_b
class InputFeatures(object):
"""A single set of features of data."""
def __init__(self, unique_id, tokens, input_ids, input_mask, input_type_ids):
self.unique_id = unique_id
self.tokens = tokens
self.input_ids = input_ids
self.input_mask = input_mask
self.input_type_ids = input_type_ids
def convert_examples_to_features(examples, seq_length, tokenizer):
"""Loads a data file into a list of `InputFeature`s."""
features = []
for (ex_index, example) in enumerate(examples):
tokens_a = tokenizer.tokenize(example.text_a)
tokens_b = None
if example.text_b:
tokens_b = tokenizer.tokenize(example.text_b)
if tokens_b:
# Modifies `tokens_a` and `tokens_b` in place so that the total
# length is less than the specified length.
# Account for [CLS], [SEP], [SEP] with "- 3"
_truncate_seq_pair(tokens_a, tokens_b, seq_length - 3)
else:
# Account for [CLS] and [SEP] with "- 2"
if len(tokens_a) > seq_length - 2:
tokens_a = tokens_a[0:(seq_length - 2)]
# The convention in BERT is:
# (a) For sequence pairs:
# tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
# type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1
# (b) For single sequences:
# tokens: [CLS] the dog is hairy . [SEP]
# type_ids: 0 0 0 0 0 0 0
#
# Where "type_ids" are used to indicate whether this is the first
# sequence or the second sequence. The embedding vectors for `type=0` and
# `type=1` were learned during pre-training and are added to the wordpiece
# embedding vector (and position vector). This is not *strictly* necessary
# since the [SEP] token unambigiously separates the sequences, but it makes
# it easier for the model to learn the concept of sequences.
#
# For classification tasks, the first vector (corresponding to [CLS]) is
# used as as the "sentence vector". Note that this only makes sense because
# the entire model is fine-tuned.
tokens = []
input_type_ids = []
tokens.append("[CLS]")
input_type_ids.append(0)
for token in tokens_a:
tokens.append(token)
input_type_ids.append(0)
tokens.append("[SEP]")
input_type_ids.append(0)
if tokens_b:
for token in tokens_b:
tokens.append(token)
input_type_ids.append(1)
tokens.append("[SEP]")
input_type_ids.append(1)
input_ids = tokenizer.convert_tokens_to_ids(tokens)
# The mask has 1 for real tokens and 0 for padding tokens. Only real
# tokens are attended to.
input_mask = [1] * len(input_ids)
# Zero-pad up to the sequence length.
while len(input_ids) < seq_length:
input_ids.append(0)
input_mask.append(0)
input_type_ids.append(0)
assert len(input_ids) == seq_length
assert len(input_mask) == seq_length
assert len(input_type_ids) == seq_length
if ex_index < 5:
logger.info("*** Example ***")
logger.info("unique_id: %s" % (example.unique_id))
logger.info("tokens: %s" % " ".join([str(x) for x in tokens]))
logger.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
logger.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
logger.info(
"input_type_ids: %s" % " ".join([str(x) for x in input_type_ids]))
features.append(
InputFeatures(
unique_id=example.unique_id,
tokens=tokens,
input_ids=input_ids,
input_mask=input_mask,
input_type_ids=input_type_ids))
return features
def _truncate_seq_pair(tokens_a, tokens_b, max_length):
"""Truncates a sequence pair in place to the maximum length."""
# This is a simple heuristic which will always truncate the longer sequence
# one token at a time. This makes more sense than truncating an equal percent
# of tokens from each, since if one sequence is very short then each token
# that's truncated likely contains more information than a longer sequence.
while True:
total_length = len(tokens_a) + len(tokens_b)
if total_length <= max_length:
break
if len(tokens_a) > len(tokens_b):
tokens_a.pop()
else:
tokens_b.pop()
def read_examples(input_file):
"""Read a list of `InputExample`s from an input file."""
examples = []
unique_id = 0
with open(input_file, "r", encoding='utf-8') as reader:
while True:
line = reader.readline()
if not line:
break
line = line.strip()
text_a = None
text_b = None
m = re.match(r"^(.*) \|\|\| (.*)$", line)
if m is None:
text_a = line
else:
text_a = m.group(1)
text_b = m.group(2)
examples.append(
InputExample(unique_id=unique_id, text_a=text_a, text_b=text_b))
unique_id += 1
return examples
def main():
parser = argparse.ArgumentParser()
## Required parameters
parser.add_argument("--input_file", default=None, type=str, required=True)
parser.add_argument("--output_file", default=None, type=str, required=True)
parser.add_argument("--bert_model", default=None, type=str, required=True,
help="Bert pre-trained model selected in the list: bert-base-uncased, "
"bert-large-uncased, bert-base-cased, bert-base-multilingual, bert-base-chinese.")
## Other parameters
parser.add_argument("--do_lower_case", action='store_true', help="Set this flag if you are using an uncased model.")
parser.add_argument("--layers", default="-1,-2,-3,-4", type=str)
parser.add_argument("--max_seq_length", default=128, type=int,
help="The maximum total input sequence length after WordPiece tokenization. Sequences longer "
"than this will be truncated, and sequences shorter than this will be padded.")
parser.add_argument("--batch_size", default=32, type=int, help="Batch size for predictions.")
parser.add_argument("--local_rank",
type=int,
default=-1,
help = "local_rank for distributed training on gpus")
parser.add_argument("--no_cuda",
action='store_true',
help="Whether not to use CUDA when available")
args = parser.parse_args()
if args.local_rank == -1 or args.no_cuda:
device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
n_gpu = torch.cuda.device_count()
else:
device = torch.device("cuda", args.local_rank)
n_gpu = 1
# Initializes the distributed backend which will take care of sychronizing nodes/GPUs
torch.distributed.init_process_group(backend='nccl')
logger.info("device: {} n_gpu: {} distributed training: {}".format(device, n_gpu, bool(args.local_rank != -1)))
layer_indexes = [int(x) for x in args.layers.split(",")]
tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case)
examples = read_examples(args.input_file)
features = convert_examples_to_features(
examples=examples, seq_length=args.max_seq_length, tokenizer=tokenizer)
unique_id_to_feature = {}
for feature in features:
unique_id_to_feature[feature.unique_id] = feature
model = BertModel.from_pretrained(args.bert_model)
model.to(device)
if args.local_rank != -1:
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
output_device=args.local_rank)
elif n_gpu > 1:
model = torch.nn.DataParallel(model)
all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
all_example_index = torch.arange(all_input_ids.size(0), dtype=torch.long)
eval_data = TensorDataset(all_input_ids, all_input_mask, all_example_index)
if args.local_rank == -1:
eval_sampler = SequentialSampler(eval_data)
else:
eval_sampler = DistributedSampler(eval_data)
eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=args.batch_size)
model.eval()
with open(args.output_file, "w", encoding='utf-8') as writer:
for input_ids, input_mask, example_indices in eval_dataloader:
input_ids = input_ids.to(device)
input_mask = input_mask.to(device)
all_encoder_layers, _ = model(input_ids, token_type_ids=None, attention_mask=input_mask)
all_encoder_layers = all_encoder_layers
for b, example_index in enumerate(example_indices):
feature = features[example_index.item()]
unique_id = int(feature.unique_id)
# feature = unique_id_to_feature[unique_id]
output_json = collections.OrderedDict()
output_json["linex_index"] = unique_id
all_out_features = []
for (i, token) in enumerate(feature.tokens):
all_layers = []
for (j, layer_index) in enumerate(layer_indexes):
layer_output = all_encoder_layers[int(layer_index)].detach().cpu().numpy()
layer_output = layer_output[b]
layers = collections.OrderedDict()
layers["index"] = layer_index
layers["values"] = [
round(x.item(), 6) for x in layer_output[i]
]
all_layers.append(layers)
out_features = collections.OrderedDict()
out_features["token"] = token
out_features["layers"] = all_layers
all_out_features.append(out_features)
output_json["features"] = all_out_features
writer.write(json.dumps(output_json) + "\n")
if __name__ == "__main__":
main()
================================================
FILE: examples/lm_finetuning/README.md
================================================
# BERT Model Finetuning using Masked Language Modeling objective
## Introduction
The three example scripts in this folder can be used to **fine-tune** a pre-trained BERT model using the pretraining objective (combination of masked language modeling and next sentence prediction loss). In general, pretrained models like BERT are first trained with a pretraining objective (masked language modeling and next sentence prediction for BERT) on a large and general natural language corpus. A classifier head is then added on top of the pre-trained architecture and the model is quickly fine-tuned on a target task, while still (hopefully) retaining its general language understanding. This greatly reduces overfitting and yields state-of-the-art results, especially when training data for the target task are limited.
The [ULMFiT paper](https://arxiv.org/abs/1801.06146) took a slightly different approach, however, and added an intermediate step in which the model is fine-tuned on text **from the same domain as the target task and using the pretraining objective** before the final stage in which the classifier head is added and the model is trained on the target task itself. This paper reported significantly improved results from this step, and found that they could get high-quality classifications even with only tiny numbers (<1000) of labelled training examples, as long as they had a lot of unlabelled data from the target domain.
Although this wasn't covered in the original BERT paper, domain-specific fine-tuning of Transformer models has [recently been reported by other authors](https://arxiv.org/pdf/1905.05583.pdf), and they report performance improvements as well.
## Input format
The scripts in this folder expect a single file as input, consisting of untokenized text, with one **sentence** per line, and one blank line between documents. The reason for the sentence splitting is that part of BERT's training involves a _next sentence_ objective in which the model must predict whether two sequences of text are contiguous text from the same document or not, and to avoid making the task _too easy_, the split point between the sequences is always at the end of a sentence. The linebreaks in the file are therefore necessary to mark the points where the text can be split.
## Usage
There are two ways to fine-tune a language model using these scripts. The first _quick_ approach is to use [`simple_lm_finetuning.py`](./simple_lm_finetuning.py). This script does everything in a single script, but generates training instances that consist of just two sentences. This is quite different from the BERT paper, where (confusingly) the NextSentence task concatenated sentences together from each document to form two long multi-sentences, which the paper just referred to as _sentences_. The difference between this simple approach and the original paper approach can have a significant effect for long sequences since two sentences will be much shorter than the max sequence length. In this case, most of each training example will just consist of blank padding characters, which wastes a lot of computation and results in a model that isn't really training on long sequences.
As such, the preferred approach (assuming you have documents containing multiple contiguous sentences from your target domain) is to use [`pregenerate_training_data.py`](./pregenerate_training_data.py) to pre-process your data into training examples following the methodology used for LM training in the original BERT paper and repository. Since there is a significant random component to training data generation for BERT, this script includes an option to generate multiple _epochs_ of pre-processed data, to avoid training on the same random splits each epoch. Generating an epoch of data for each training epoch should result a better final model, and so we recommend doing so.
You can then train on the pregenerated data using [`finetune_on_pregenerated.py`](./finetune_on_pregenerated.py), and pointing it to the folder created by [`pregenerate_training_data.py`](./pregenerate_training_data.py). Note that you should use the same `bert_model` and case options for both! Also note that `max_seq_len` does not need to be specified for the [`finetune_on_pregenerated.py`](./finetune_on_pregenerated.py) script, as it is inferred from the training examples.
There are various options that can be tweaked, but they are mostly set to the values from the BERT paper/repository and default values should make sense. The most relevant ones are:
- `--max_seq_len`: Controls the length of training examples (in wordpiece tokens) seen by the model. Defaults to 128 but can be set as high as 512. Higher values may yield stronger language models at the cost of slower and more memory-intensive training.
- `--fp16`: Enables fast half-precision training on recent GPUs.
In addition, if memory usage is an issue, especially when training on a single GPU, reducing `--train_batch_size` from the default 32 to a lower number (4-16) can be helpful, or leaving `--train_batch_size` at the default and increasing `--gradient_accumulation_steps` to 2-8. Changing `--gradient_accumulation_steps` may be preferable as alterations to the batch size may require corresponding changes in the learning rate to compensate. There is also a `--reduce_memory` option for both the `pregenerate_training_data.py` and `finetune_on_pregenerated.py` scripts that spills data to disc in shelf objects or numpy memmaps rather than retaining it in memory, which significantly reduces memory usage with little performance impact.
## Examples
### Simple fine-tuning
```
python3 simple_lm_finetuning.py
--train_corpus my_corpus.txt
--bert_model bert-base-uncased
--do_lower_case
--output_dir finetuned_lm/
--do_train
```
### Pregenerating training data
```
python3 pregenerate_training_data.py
--train_corpus my_corpus.txt
--bert_model bert-base-uncased
--do_lower_case
--output_dir training/
--epochs_to_generate 3
--max_seq_len 256
```
### Training on pregenerated data
```
python3 finetune_on_pregenerated.py
--pregenerated_data training/
--bert_model bert-base-uncased
--do_lower_case
--output_dir finetuned_lm/
--epochs 3
```
================================================
FILE: examples/lm_finetuning/finetune_on_pregenerated.py
================================================
from argparse import ArgumentParser
from pathlib import Path
import os
import torch
import logging
import json
import random
import numpy as np
from collections import namedtuple
from tempfile import TemporaryDirectory
from torch.utils.data import DataLoader, Dataset, RandomSampler
from torch.utils.data.distributed import DistributedSampler
from tqdm import tqdm
from pytorch_pretrained_bert import WEIGHTS_NAME, CONFIG_NAME
from pytorch_pretrained_bert.modeling import BertForPreTraining
from pytorch_pretrained_bert.tokenization import BertTokenizer
from pytorch_pretrained_bert.optimization import BertAdam, WarmupLinearSchedule
InputFeatures = namedtuple("InputFeatures", "input_ids input_mask segment_ids lm_label_ids is_next")
log_format = '%(asctime)-10s: %(message)s'
logging.basicConfig(level=logging.INFO, format=log_format)
def convert_example_to_features(example, tokenizer, max_seq_length):
tokens = example["tokens"]
segment_ids = example["segment_ids"]
is_random_next = example["is_random_next"]
masked_lm_positions = example["masked_lm_positions"]
masked_lm_labels = example["masked_lm_labels"]
assert len(tokens) == len(segment_ids) <= max_seq_length # The preprocessed data should be already truncated
input_ids = tokenizer.convert_tokens_to_ids(tokens)
masked_label_ids = tokenizer.convert_tokens_to_ids(masked_lm_labels)
input_array = np.zeros(max_seq_length, dtype=np.int)
input_array[:len(input_ids)] = input_ids
mask_array = np.zeros(max_seq_length, dtype=np.bool)
mask_array[:len(input_ids)] = 1
segment_array = np.zeros(max_seq_length, dtype=np.bool)
segment_array[:len(segment_ids)] = segment_ids
lm_label_array = np.full(max_seq_length, dtype=np.int, fill_value=-1)
lm_label_array[masked_lm_positions] = masked_label_ids
features = InputFeatures(input_ids=input_array,
input_mask=mask_array,
segment_ids=segment_array,
lm_label_ids=lm_label_array,
is_next=is_random_next)
return features
class PregeneratedDataset(Dataset):
def __init__(self, training_path, epoch, tokenizer, num_data_epochs, reduce_memory=False):
self.vocab = tokenizer.vocab
self.tokenizer = tokenizer
self.epoch = epoch
self.data_epoch = epoch % num_data_epochs
data_file = training_path / f"epoch_{self.data_epoch}.json"
metrics_file = training_path / f"epoch_{self.data_epoch}_metrics.json"
assert data_file.is_file() and metrics_file.is_file()
metrics = json.loads(metrics_file.read_text())
num_samples = metrics['num_training_examples']
seq_len = metrics['max_seq_len']
self.temp_dir = None
self.working_dir = None
if reduce_memory:
self.temp_dir = TemporaryDirectory()
self.working_dir = Path(self.temp_dir.name)
input_ids = np.memmap(filename=self.working_dir/'input_ids.memmap',
mode='w+', dtype=np.int32, shape=(num_samples, seq_len))
input_masks = np.memmap(filename=self.working_dir/'input_masks.memmap',
shape=(num_samples, seq_len), mode='w+', dtype=np.bool)
segment_ids = np.memmap(filename=self.working_dir/'segment_ids.memmap',
shape=(num_samples, seq_len), mode='w+', dtype=np.bool)
lm_label_ids = np.memmap(filename=self.working_dir/'lm_label_ids.memmap',
shape=(num_samples, seq_len), mode='w+', dtype=np.int32)
lm_label_ids[:] = -1
is_nexts = np.memmap(filename=self.working_dir/'is_nexts.memmap',
shape=(num_samples,), mode='w+', dtype=np.bool)
else:
input_ids = np.zeros(shape=(num_samples, seq_len), dtype=np.int32)
input_masks = np.zeros(shape=(num_samples, seq_len), dtype=np.bool)
segment_ids = np.zeros(shape=(num_samples, seq_len), dtype=np.bool)
lm_label_ids = np.full(shape=(num_samples, seq_len), dtype=np.int32, fill_value=-1)
is_nexts = np.zeros(shape=(num_samples,), dtype=np.bool)
logging.info(f"Loading training examples for epoch {epoch}")
with data_file.open() as f:
for i, line in enumerate(tqdm(f, total=num_samples, desc="Training examples")):
line = line.strip()
example = json.loads(line)
features = convert_example_to_features(example, tokenizer, seq_len)
input_ids[i] = features.input_ids
segment_ids[i] = features.segment_ids
input_masks[i] = features.input_mask
lm_label_ids[i] = features.lm_label_ids
is_nexts[i] = features.is_next
assert i == num_samples - 1 # Assert that the sample count metric was true
logging.info("Loading complete!")
self.num_samples = num_samples
self.seq_len = seq_len
self.input_ids = input_ids
self.input_masks = input_masks
self.segment_ids = segment_ids
self.lm_label_ids = lm_label_ids
self.is_nexts = is_nexts
def __len__(self):
return self.num_samples
def __getitem__(self, item):
return (torch.tensor(self.input_ids[item].astype(np.int64)),
torch.tensor(self.input_masks[item].astype(np.int64)),
torch.tensor(self.segment_ids[item].astype(np.int64)),
torch.tensor(self.lm_label_ids[item].astype(np.int64)),
torch.tensor(self.is_nexts[item].astype(np.int64)))
def main():
parser = ArgumentParser()
parser.add_argument('--pregenerated_data', type=Path, required=True)
parser.add_argument('--output_dir', type=Path, required=True)
parser.add_argument("--bert_model", type=str, required=True, help="Bert pre-trained model selected in the list: bert-base-uncased, "
"bert-large-uncased, bert-base-cased, bert-base-multilingual, bert-base-chinese.")
parser.add_argument("--do_lower_case", action="store_true")
parser.add_argument("--reduce_memory", action="store_true",
help="Store training data as on-disc memmaps to massively reduce memory usage")
parser.add_argument("--epochs", type=int, default=3, help="Number of epochs to train for")
parser.add_argument("--local_rank",
type=int,
default=-1,
help="local_rank for distributed training on gpus")
parser.add_argument("--no_cuda",
action='store_true',
help="Whether not to use CUDA when available")
parser.add_argument('--gradient_accumulation_steps',
type=int,
default=1,
help="Number of updates steps to accumulate before performing a backward/update pass.")
parser.add_argument("--train_batch_size",
default=32,
type=int,
help="Total batch size for training.")
parser.add_argument('--fp16',
action='store_true',
help="Whether to use 16-bit float precision instead of 32-bit")
parser.add_argument('--loss_scale',
type=float, default=0,
help="Loss scaling to improve fp16 numeric stability. Only used when fp16 set to True.\n"
"0 (default value): dynamic loss scaling.\n"
"Positive power of 2: static loss scaling value.\n")
parser.add_argument("--warmup_proportion",
default=0.1,
type=float,
help="Proportion of training to perform linear learning rate warmup for. "
"E.g., 0.1 = 10%% of training.")
parser.add_argument("--learning_rate",
default=3e-5,
type=float,
help="The initial learning rate for Adam.")
parser.add_argument('--seed',
type=int,
default=42,
help="random seed for initialization")
args = parser.parse_args()
assert args.pregenerated_data.is_dir(), \
"--pregenerated_data should point to the folder of files made by pregenerate_training_data.py!"
samples_per_epoch = []
for i in range(args.epochs):
epoch_file = args.pregenerated_data / f"epoch_{i}.json"
metrics_file = args.pregenerated_data / f"epoch_{i}_metrics.json"
if epoch_file.is_file() and metrics_file.is_file():
metrics = json.loads(metrics_file.read_text())
samples_per_epoch.append(metrics['num_training_examples'])
else:
if i == 0:
exit("No training data was found!")
print(f"Warning! There are fewer epochs of pregenerated data ({i}) than training epochs ({args.epochs}).")
print("This script will loop over the available data, but training diversity may be negatively impacted.")
num_data_epochs = i
break
else:
num_data_epochs = args.epochs
if args.local_rank == -1 or args.no_cuda:
device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
n_gpu = torch.cuda.device_count()
else:
torch.cuda.set_device(args.local_rank)
device = torch.device("cuda", args.local_rank)
n_gpu = 1
# Initializes the distributed backend which will take care of sychronizing nodes/GPUs
torch.distributed.init_process_group(backend='nccl')
logging.info("device: {} n_gpu: {}, distributed training: {}, 16-bits training: {}".format(
device, n_gpu, bool(args.local_rank != -1), args.fp16))
if args.gradient_accumulation_steps < 1:
raise ValueError("Invalid gradient_accumulation_steps parameter: {}, should be >= 1".format(
args.gradient_accumulation_steps))
args.train_batch_size = args.train_batch_size // args.gradient_accumulation_steps
random.seed(args.seed)
np.random.seed(args.seed)
torch.manual_seed(args.seed)
if n_gpu > 0:
torch.cuda.manual_seed_all(args.seed)
if args.output_dir.is_dir() and list(args.output_dir.iterdir()):
logging.warning(f"Output directory ({args.output_dir}) already exists and is not empty!")
args.output_dir.mkdir(parents=True, exist_ok=True)
tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case)
total_train_examples = 0
for i in range(args.epochs):
# The modulo takes into account the fact that we may loop over limited epochs of data
total_train_examples += samples_per_epoch[i % len(samples_per_epoch)]
num_train_optimization_steps = int(
total_train_examples / args.train_batch_size / args.gradient_accumulation_steps)
if args.local_rank != -1:
num_train_optimization_steps = num_train_optimization_steps // torch.distributed.get_world_size()
# Prepare model
model = BertForPreTraining.from_pretrained(args.bert_model)
if args.fp16:
model.half()
model.to(device)
if args.local_rank != -1:
try:
from apex.parallel import DistributedDataParallel as DDP
except ImportError:
raise ImportError(
"Please install apex from https://www.github.com/nvidia/apex to use distributed and fp16 training.")
model = DDP(model)
elif n_gpu > 1:
model = torch.nn.DataParallel(model)
# Prepare optimizer
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
{'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
'weight_decay': 0.01},
{'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]
if args.fp16:
try:
from apex.optimizers import FP16_Optimizer
from apex.optimizers import FusedAdam
except ImportError:
raise ImportError(
"Please install apex from https://www.github.com/nvidia/apex to use distributed and fp16 training.")
optimizer = FusedAdam(optimizer_grouped_parameters,
lr=args.learning_rate,
bias_correction=False,
max_grad_norm=1.0)
if args.loss_scale == 0:
optimizer = FP16_Optimizer(optimizer, dynamic_loss_scale=True)
else:
optimizer = FP16_Optimizer(optimizer, static_loss_scale=args.loss_scale)
warmup_linear = WarmupLinearSchedule(warmup=args.warmup_proportion,
t_total=num_train_optimization_steps)
else:
optimizer = BertAdam(optimizer_grouped_parameters,
lr=args.learning_rate,
warmup=args.warmup_proportion,
t_total=num_train_optimization_steps)
global_step = 0
logging.info("***** Running training *****")
logging.info(f" Num examples = {total_train_examples}")
logging.info(" Batch size = %d", args.train_batch_size)
logging.info(" Num steps = %d", num_train_optimization_steps)
model.train()
for epoch in range(args.epochs):
epoch_dataset = PregeneratedDataset(epoch=epoch, training_path=args.pregenerated_data, tokenizer=tokenizer,
num_data_epochs=num_data_epochs, reduce_memory=args.reduce_memory)
if args.local_rank == -1:
train_sampler = RandomSampler(epoch_dataset)
else:
train_sampler = DistributedSampler(epoch_dataset)
train_dataloader = DataLoader(epoch_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
tr_loss = 0
nb_tr_examples, nb_tr_steps = 0, 0
with tqdm(total=len(train_dataloader), desc=f"Epoch {epoch}") as pbar:
for step, batch in enumerate(train_dataloader):
batch = tuple(t.to(device) for t in batch)
input_ids, input_mask, segment_ids, lm_label_ids, is_next = batch
loss = model(input_ids, segment_ids, input_mask, lm_label_ids, is_next)
if n_gpu > 1:
loss = loss.mean() # mean() to average on multi-gpu.
if args.gradient_accumulation_steps > 1:
loss = loss / args.gradient_accumulation_steps
if args.fp16:
optimizer.backward(loss)
else:
loss.backward()
tr_loss += loss.item()
nb_tr_examples += input_ids.size(0)
nb_tr_steps += 1
pbar.update(1)
mean_loss = tr_loss * args.gradient_accumulation_steps / nb_tr_steps
pbar.set_postfix_str(f"Loss: {mean_loss:.5f}")
if (step + 1) % args.gradient_accumulation_steps == 0:
if args.fp16:
# modify learning rate with special warm up BERT uses
# if args.fp16 is False, BertAdam is used that handles this automatically
lr_this_step = args.learning_rate * warmup_linear.get_lr(global_step, args.warmup_proportion)
for param_group in optimizer.param_groups:
param_group['lr'] = lr_this_step
optimizer.step()
optimizer.zero_grad()
global_step += 1
# Save a trained model
logging.info("** ** * Saving fine-tuned model ** ** * ")
model_to_save = model.module if hasattr(model, 'module') else model # Only save the model it-self
output_model_file = os.path.join(args.output_dir, WEIGHTS_NAME)
output_config_file = os.path.join(args.output_dir, CONFIG_NAME)
torch.save(model_to_save.state_dict(), output_model_file)
model_to_save.config.to_json_file(output_config_file)
tokenizer.save_vocabulary(args.output_dir)
if __name__ == '__main__':
main()
================================================
FILE: examples/lm_finetuning/pregenerate_training_data.py
================================================
from argparse import ArgumentParser
from pathlib import Path
from tqdm import tqdm, trange
from tempfile import TemporaryDirectory
import shelve
from multiprocessing import Pool
from random import random, randrange, randint, shuffle, choice
from pytorch_pretrained_bert.tokenization import BertTokenizer
import numpy as np
import json
import collections
class DocumentDatabase:
def __init__(self, reduce_memory=False):
if reduce_memory:
self.temp_dir = TemporaryDirectory()
self.working_dir = Path(self.temp_dir.name)
self.document_shelf_filepath = self.working_dir / 'shelf.db'
self.document_shelf = shelve.open(str(self.document_shelf_filepath),
flag='n', protocol=-1)
self.documents = None
else:
self.documents = []
self.document_shelf = None
self.document_shelf_filepath = None
self.temp_dir = None
self.doc_lengths = []
self.doc_cumsum = None
self.cumsum_max = None
self.reduce_memory = reduce_memory
def add_document(self, document):
if not document:
return
if self.reduce_memory:
current_idx = len(self.doc_lengths)
self.document_shelf[str(current_idx)] = document
else:
self.documents.append(document)
self.doc_lengths.append(len(document))
def _precalculate_doc_weights(self):
self.doc_cumsum = np.cumsum(self.doc_lengths)
self.cumsum_max = self.doc_cumsum[-1]
def sample_doc(self, current_idx, sentence_weighted=True):
# Uses the current iteration counter to ensure we don't sample the same doc twice
if sentence_weighted:
# With sentence weighting, we sample docs proportionally to their sentence length
if self.doc_cumsum is None or len(self.doc_cumsum) != len(self.doc_lengths):
self._precalculate_doc_weights()
rand_start = self.doc_cumsum[current_idx]
rand_end = rand_start + self.cumsum_max - self.doc_lengths[current_idx]
sentence_index = randrange(rand_start, rand_end) % self.cumsum_max
sampled_doc_index = np.searchsorted(self.doc_cumsum, sentence_index, side='right')
else:
# If we don't use sentence weighting, then every doc has an equal chance to be chosen
sampled_doc_index = (current_idx + randrange(1, len(self.doc_lengths))) % len(self.doc_lengths)
assert sampled_doc_index != current_idx
if self.reduce_memory:
return self.document_shelf[str(sampled_doc_index)]
else:
return self.documents[sampled_doc_index]
def __len__(self):
return len(self.doc_lengths)
def __getitem__(self, item):
if self.reduce_memory:
return self.document_shelf[str(item)]
else:
return self.documents[item]
def __enter__(self):
return self
def __exit__(self, exc_type, exc_val, traceback):
if self.document_shelf is not None:
self.document_shelf.close()
if self.temp_dir is not None:
self.temp_dir.cleanup()
def truncate_seq_pair(tokens_a, tokens_b, max_num_tokens):
"""Truncates a pair of sequences to a maximum sequence length. Lifted from Google's BERT repo."""
while True:
total_length = len(tokens_a) + len(tokens_b)
if total_length <= max_num_tokens:
break
trunc_tokens = tokens_a if len(tokens_a) > len(tokens_b) else tokens_b
assert len(trunc_tokens) >= 1
# We want to sometimes truncate from the front and sometimes from the
# back to add more randomness and avoid biases.
if random() < 0.5:
del trunc_tokens[0]
else:
trunc_tokens.pop()
MaskedLmInstance = collections.namedtuple("MaskedLmInstance",
["index", "label"])
def create_masked_lm_predictions(tokens, masked_lm_prob, max_predictions_per_seq, whole_word_mask, vocab_list):
"""Creates the predictions for the masked LM objective. This is mostly copied from the Google BERT repo, but
with several refactors to clean it up and remove a lot of unnecessary variables."""
cand_indices = []
for (i, token) in enumerate(tokens):
if token == "[CLS]" or token == "[SEP]":
continue
# Whole Word Masking means that if we mask all of the wordpieces
# corresponding to an original word. When a word has been split into
# WordPieces, the first token does not have any marker and any subsequence
# tokens are prefixed with ##. So whenever we see the ## token, we
# append it to the previous set of word indexes.
#
# Note that Whole Word Masking does *not* change the training code
# at all -- we still predict each WordPiece independently, softmaxed
# over the entire vocabulary.
if (whole_word_mask and len(cand_indices) >= 1 and token.startswith("##")):
cand_indices[-1].append(i)
else:
cand_indices.append([i])
num_to_mask = min(max_predictions_per_seq,
max(1, int(round(len(tokens) * masked_lm_prob))))
shuffle(cand_indices)
masked_lms = []
covered_indexes = set()
for index_set in cand_indices:
if len(masked_lms) >= num_to_mask:
break
# If adding a whole-word mask would exceed the maximum number of
# predictions, then just skip this candidate.
if len(masked_lms) + len(index_set) > num_to_mask:
continue
is_any_index_covered = False
for index in index_set:
if index in covered_indexes:
is_any_index_covered = True
break
if is_any_index_covered:
continue
for index in index_set:
covered_indexes.add(index)
masked_token = None
# 80% of the time, replace with [MASK]
if random() < 0.8:
masked_token = "[MASK]"
else:
# 10% of the time, keep original
if random() < 0.5:
masked_token = tokens[index]
# 10% of the time, replace with random word
else:
masked_token = choice(vocab_list)
masked_lms.append(MaskedLmInstance(index=index, label=tokens[index]))
tokens[index] = masked_token
assert len(masked_lms) <= num_to_mask
masked_lms = sorted(masked_lms, key=lambda x: x.index)
mask_indices = [p.index for p in masked_lms]
masked_token_labels = [p.label for p in masked_lms]
return tokens, mask_indices, masked_token_labels
def create_instances_from_document(
doc_database, doc_idx, max_seq_length, short_seq_prob,
masked_lm_prob, max_predictions_per_seq, whole_word_mask, vocab_list):
"""This code is mostly a duplicate of the equivalent function from Google BERT's repo.
However, we make some changes and improvements. Sampling is improved and no longer requires a loop in this function.
Also, documents are sampled proportionally to the number of sentences they contain, which means each sentence
(rather than each document) has an equal chance of being sampled as a false example for the NextSentence task."""
document = doc_database[doc_idx]
# Account for [CLS], [SEP], [SEP]
max_num_tokens = max_seq_length - 3
# We *usually* want to fill up the entire sequence since we are padding
# to `max_seq_length` anyways, so short sequences are generally wasted
# computation. However, we *sometimes*
# (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter
# sequences to minimize the mismatch between pre-training and fine-tuning.
# The `target_seq_length` is just a rough target however, whereas
# `max_seq_length` is a hard limit.
target_seq_length = max_num_tokens
if random() < short_seq_prob:
target_seq_length = randint(2, max_num_tokens)
# We DON'T just concatenate all of the tokens from a document into a long
# sequence and choose an arbitrary split point because this would make the
# next sentence prediction task too easy. Instead, we split the input into
# segments "A" and "B" based on the actual "sentences" provided by the user
# input.
instances = []
current_chunk = []
current_length = 0
i = 0
while i < len(document):
segment = document[i]
current_chunk.append(segment)
current_length += len(segment)
if i == len(document) - 1 or current_length >= target_seq_length:
if current_chunk:
# `a_end` is how many segments from `current_chunk` go into the `A`
# (first) sentence.
a_end = 1
if len(current_chunk) >= 2:
a_end = randrange(1, len(current_chunk))
tokens_a = []
for j in range(a_end):
tokens_a.extend(current_chunk[j])
tokens_b = []
# Random next
if len(current_chunk) == 1 or random() < 0.5:
is_random_next = True
target_b_length = target_seq_length - len(tokens_a)
# Sample a random document, with longer docs being sampled more frequently
random_document = doc_database.sample_doc(current_idx=doc_idx, sentence_weighted=True)
random_start = randrange(0, len(random_document))
for j in range(random_start, len(random_document)):
tokens_b.extend(random_document[j])
if len(tokens_b) >= target_b_length:
break
# We didn't actually use these segments so we "put them back" so
# they don't go to waste.
num_unused_segments = len(current_chunk) - a_end
i -= num_unused_segments
# Actual next
else:
is_random_next = False
for j in range(a_end, len(current_chunk)):
tokens_b.extend(current_chunk[j])
truncate_seq_pair(tokens_a, tokens_b, max_num_tokens)
assert len(tokens_a) >= 1
assert len(tokens_b) >= 1
tokens = ["[CLS]"] + tokens_a + ["[SEP]"] + tokens_b + ["[SEP]"]
# The segment IDs are 0 for the [CLS] token, the A tokens and the first [SEP]
# They are 1 for the B tokens and the final [SEP]
segment_ids = [0 for _ in range(len(tokens_a) + 2)] + [1 for _ in range(len(tokens_b) + 1)]
tokens, masked_lm_positions, masked_lm_labels = create_masked_lm_predictions(
tokens, masked_lm_prob, max_predictions_per_seq, whole_word_mask, vocab_list)
instance = {
"tokens": tokens,
"segment_ids": segment_ids,
"is_random_next": is_random_next,
"masked_lm_positions": masked_lm_positions,
"masked_lm_labels": masked_lm_labels}
instances.append(instance)
current_chunk = []
current_length = 0
i += 1
return instances
def create_training_file(docs, vocab_list, args, epoch_num):
epoch_filename = args.output_dir / "epoch_{}.json".format(epoch_num)
num_instances = 0
with epoch_filename.open('w') as epoch_file:
for doc_idx in trange(len(docs), desc="Document"):
doc_instances = create_instances_from_document(
docs, doc_idx, max_seq_length=args.max_seq_len, short_seq_prob=args.short_seq_prob,
masked_lm_prob=args.masked_lm_prob, max_predictions_per_seq=args.max_predictions_per_seq,
whole_word_mask=args.do_whole_word_mask, vocab_list=vocab_list)
doc_instances = [json.dumps(instance) for instance in doc_instances]
for instance in doc_instances:
epoch_file.write(instance + '\n')
num_instances += 1
metrics_file = args.output_dir / "epoch_{}_metrics.json".format(epoch_num)
with metrics_file.open('w') as metrics_file:
metrics = {
"num_training_examples": num_instances,
"max_seq_len": args.max_seq_len
}
metrics_file.write(json.dumps(metrics))
def main():
parser = ArgumentParser()
parser.add_argument('--train_corpus', type=Path, required=True)
parser.add_argument("--output_dir", type=Path, required=True)
parser.add_argument("--bert_model", type=str, required=True,
choices=["bert-base-uncased", "bert-large-uncased", "bert-base-cased",
"bert-base-multilingual-uncased", "bert-base-chinese", "bert-base-multilingual-cased"])
parser.add_argument("--do_lower_case", action="store_true")
parser.add_argument("--do_whole_word_mask", action="store_true",
help="Whether to use whole word masking rather than per-WordPiece masking.")
parser.add_argument("--reduce_memory", action="store_true",
help="Reduce memory usage for large datasets by keeping data on disc rather than in memory")
parser.add_argument("--num_workers", type=int, default=1,
help="The number of workers to use to write the files")
parser.add_argument("--epochs_to_generate", type=int, default=3,
help="Number of epochs of data to pregenerate")
parser.add_argument("--max_seq_len", type=int, default=128)
parser.add_argument("--short_seq_prob", type=float, default=0.1,
help="Probability of making a short sentence as a training example")
parser.add_argument("--masked_lm_prob", type=float, default=0.15,
help="Probability of masking each token for the LM task")
parser.add_argument("--max_predictions_per_seq", type=int, default=20,
help="Maximum number of tokens to mask in each sequence")
args = parser.parse_args()
if args.num_workers > 1 and args.reduce_memory:
raise ValueError("Cannot use multiple workers while reducing memory")
tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case)
vocab_list = list(tokenizer.vocab.keys())
with DocumentDatabase(reduce_memory=args.reduce_memory) as docs:
with args.train_corpus.open() as f:
doc = []
for line in tqdm(f, desc="Loading Dataset", unit=" lines"):
line = line.strip()
if line == "":
docs.add_document(doc)
doc = []
else:
tokens = tokenizer.tokenize(line)
doc.append(tokens)
if doc:
docs.add_document(doc) # If the last doc didn't end on a newline, make sure it still gets added
if len(docs) <= 1:
exit("ERROR: No document breaks were found in the input file! These are necessary to allow the script to "
"ensure that random NextSentences are not sampled from the same document. Please add blank lines to "
"indicate breaks between documents in your input file. If your dataset does not contain multiple "
"documents, blank lines can be inserted at any natural boundary, such as the ends of chapters, "
"sections or paragraphs.")
args.output_dir.mkdir(exist_ok=True)
if args.num_workers > 1:
writer_workers = Pool(min(args.num_workers, args.epochs_to_generate))
arguments = [(docs, vocab_list, args, idx) for idx in range(args.epochs_to_generate)]
writer_workers.starmap(create_training_file, arguments)
else:
for epoch in trange(args.epochs_to_generate, desc="Epoch"):
create_training_file(docs, vocab_list, args, epoch)
if __name__ == '__main__':
main()
================================================
FILE: examples/lm_finetuning/simple_lm_finetuning.py
================================================
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""BERT finetuning runner."""
from __future__ import absolute_import, division, print_function, unicode_literals
import argparse
import logging
import os
import random
from io import open
import numpy as np
import torch
from torch.utils.data import DataLoader, Dataset, RandomSampler
from torch.utils.data.distributed import DistributedSampler
from tqdm import tqdm, trange
from pytorch_pretrained_bert import WEIGHTS_NAME, CONFIG_NAME
from pytorch_pretrained_bert.modeling import BertForPreTraining
from pytorch_pretrained_bert.tokenization import BertTokenizer
from pytorch_pretrained_bert.optimization import BertAdam, WarmupLinearSchedule
logging.basicConfig(format='%(asctime)s - %(levelname)s - %(name)s - %(message)s',
datefmt='%m/%d/%Y %H:%M:%S',
level=logging.INFO)
logger = logging.getLogger(__name__)
class BERTDataset(Dataset):
def __init__(self, corpus_path, tokenizer, seq_len, encoding="utf-8", corpus_lines=None, on_memory=True):
self.vocab = tokenizer.vocab
self.tokenizer = tokenizer
self.seq_len = seq_len
self.on_memory = on_memory
self.corpus_lines = corpus_lines # number of non-empty lines in input corpus
self.corpus_path = corpus_path
self.encoding = encoding
self.current_doc = 0 # to avoid random sentence from same doc
# for loading samples directly from file
self.sample_counter = 0 # used to keep track of full epochs on file
self.line_buffer = None # keep second sentence of a pair in memory and use as first sentence in next pair
# for loading samples in memory
self.current_random_doc = 0
self.num_docs = 0
self.sample_to_doc = [] # map sample index to doc and line
# load samples into memory
if on_memory:
self.all_docs = []
doc = []
self.corpus_lines = 0
with open(corpus_path, "r", encoding=encoding) as f:
for line in tqdm(f, desc="Loading Dataset", total=corpus_lines):
line = line.strip()
if line == "":
self.all_docs.append(doc)
doc = []
#remove last added sample because there won't be a subsequent line anymore in the doc
self.sample_to_doc.pop()
else:
#store as one sample
sample = {"doc_id": len(self.all_docs),
"line": len(doc)}
self.sample_to_doc.append(sample)
doc.append(line)
self.corpus_lines = self.corpus_lines + 1
# if last row in file is not empty
if self.all_docs[-1] != doc:
self.all_docs.append(doc)
self.sample_to_doc.pop()
self.num_docs = len(self.all_docs)
# load samples later lazily from disk
else:
if self.corpus_lines is None:
with open(corpus_path, "r", encoding=encoding) as f:
self.corpus_lines = 0
for line in tqdm(f, desc="Loading Dataset", total=corpus_lines):
if line.strip() == "":
self.num_docs += 1
else:
self.corpus_lines += 1
# if doc does not end with empty line
if line.strip() != "":
self.num_docs += 1
self.file = open(corpus_path, "r", encoding=encoding)
self.random_file = open(corpus_path, "r", encoding=encoding)
def __len__(self):
# last line of doc won't be used, because there's no "nextSentence". Additionally, we start counting at 0.
return self.corpus_lines - self.num_docs - 1
def __getitem__(self, item):
cur_id = self.sample_counter
self.sample_counter += 1
if not self.on_memory:
# after one epoch we start again from beginning of file
if cur_id != 0 and (cur_id % len(self) == 0):
self.file.close()
self.file = open(self.corpus_path, "r", encoding=self.encoding)
t1, t2, is_next_label = self.random_sent(item)
# tokenize
tokens_a = self.tokenizer.tokenize(t1)
tokens_b = self.tokenizer.tokenize(t2)
# combine to one sample
cur_example = InputExample(guid=cur_id, tokens_a=tokens_a, tokens_b=tokens_b, is_next=is_next_label)
# transform sample to features
cur_features = convert_example_to_features(cur_example, self.seq_len, self.tokenizer)
cur_tensors = (torch.tensor(cur_features.input_ids),
torch.tensor(cur_features.input_mask),
torch.tensor(cur_features.segment_ids),
torch.tensor(cur_features.lm_label_ids),
torch.tensor(cur_features.is_next))
return cur_tensors
def random_sent(self, index):
"""
Get one sample from corpus consisting of two sentences. With prob. 50% these are two subsequent sentences
from one doc. With 50% the second sentence will be a random one from another doc.
:param index: int, index of sample.
:return: (str, str, int), sentence 1, sentence 2, isNextSentence Label
"""
t1, t2 = self.get_corpus_line(index)
if random.random() > 0.5:
label = 0
else:
t2 = self.get_random_line()
label = 1
assert len(t1) > 0
assert len(t2) > 0
return t1, t2, label
def get_corpus_line(self, item):
"""
Get one sample from corpus consisting of a pair of two subsequent lines from the same doc.
:param item: int, index of sample.
:return: (str, str), two subsequent sentences from corpus
"""
t1 = ""
t2 = ""
assert item < self.corpus_lines
if self.on_memory:
sample = self.sample_to_doc[item]
t1 = self.all_docs[sample["doc_id"]][sample["line"]]
t2 = self.all_docs[sample["doc_id"]][sample["line"]+1]
# used later to avoid random nextSentence from same doc
self.current_doc = sample["doc_id"]
return t1, t2
else:
if self.line_buffer is None:
# read first non-empty line of file
while t1 == "" :
t1 = next(self.file).strip()
t2 = next(self.file).strip()
else:
# use t2 from previous iteration as new t1
t1 = self.line_buffer
t2 = next(self.file).strip()
# skip empty rows that are used for separating documents and keep track of current doc id
while t2 == "" or t1 == "":
t1 = next(self.file).strip()
t2 = next(self.file).strip()
self.current_doc = self.current_doc+1
self.line_buffer = t2
assert t1 != ""
assert t2 != ""
return t1, t2
def get_random_line(self):
"""
Get random line from another document for nextSentence task.
:return: str, content of one line
"""
# Similar to original tf repo: This outer loop should rarely go for more than one iteration for large
# corpora. However, just to be careful, we try to make sure that
# the random document is not the same as the document we're processing.
for _ in range(10):
if self.on_memory:
rand_doc_idx = random.randint(0, len(self.all_docs)-1)
rand_doc = self.all_docs[rand_doc_idx]
line = rand_doc[random.randrange(len(rand_doc))]
else:
rand_index = random.randint(1, self.corpus_lines if self.corpus_lines < 1000 else 1000)
#pick random line
for _ in range(rand_index):
line = self.get_next_line()
#check if our picked random line is really from another doc like we want it to be
if self.current_random_doc != self.current_doc:
break
return line
def get_next_line(self):
""" Gets next line of random_file and starts over when reaching end of file"""
try:
line = next(self.random_file).strip()
#keep track of which document we are currently looking at to later avoid having the same doc as t1
if line == "":
self.current_random_doc = self.current_random_doc + 1
line = next(self.random_file).strip()
except StopIteration:
self.random_file.close()
self.random_file = open(self.corpus_path, "r", encoding=self.encoding)
line = next(self.random_file).strip()
return line
class InputExample(object):
"""A single training/test example for the language model."""
def __init__(self, guid, tokens_a, tokens_b=None, is_next=None, lm_labels=None):
"""Constructs a InputExample.
Args:
guid: Unique id for the example.
tokens_a: string. The untokenized text of the first sequence. For single
sequence tasks, only this sequence must be specified.
tokens_b: (Optional) string. The untokenized text of the second sequence.
Only must be specified for sequence pair tasks.
label: (Optional) string. The label of the example. This should be
specified for train and dev examples, but not for test examples.
"""
self.guid = guid
self.tokens_a = tokens_a
self.tokens_b = tokens_b
self.is_next = is_next # nextSentence
self.lm_labels = lm_labels # masked words for language model
class InputFeatures(object):
"""A single set of features of data."""
def __init__(self, input_ids, input_mask, segment_ids, is_next, lm_label_ids):
self.input_ids = input_ids
self.input_mask = input_mask
self.segment_ids = segment_ids
self.is_next = is_next
self.lm_label_ids = lm_label_ids
def random_word(tokens, tokenizer):
"""
Masking some random tokens for Language Model task with probabilities as in the original BERT paper.
:param tokens: list of str, tokenized sentence.
:param tokenizer: Tokenizer, object used for tokenization (we need it's vocab here)
:return: (list of str, list of int), masked tokens and related labels for LM prediction
"""
output_label = []
for i, token in enumerate(tokens):
prob = random.random()
# mask token with 15% probability
if prob < 0.15:
prob /= 0.15
# 80% randomly change token to mask token
if prob < 0.8:
tokens[i] = "[MASK]"
# 10% randomly change token to random token
elif prob < 0.9:
tokens[i] = random.choice(list(tokenizer.vocab.items()))[0]
# -> rest 10% randomly keep current token
# append current token to output (we will predict these later)
try:
output_label.append(tokenizer.vocab[token])
except KeyError:
# For unknown words (should not occur with BPE vocab)
output_label.append(tokenizer.vocab["[UNK]"])
logger.warning("Cannot find token '{}' in vocab. Using [UNK] insetad".format(token))
else:
# no masking token (will be ignored by loss function later)
output_label.append(-1)
return tokens, output_label
def convert_example_to_features(example, max_seq_length, tokenizer):
"""
Convert a raw sample (pair of sentences as tokenized strings) into a proper training sample with
IDs, LM labels, input_mask, CLS and SEP tokens etc.
:param example: InputExample, containing sentence input as strings and is_next label
:param max_seq_length: int, maximum length of sequence.
:param tokenizer: Tokenizer
:return: InputFeatures, containing all inputs and labels of one sample as IDs (as used for model training)
"""
tokens_a = example.tokens_a
tokens_b = example.tokens_b
# Modifies `tokens_a` and `tokens_b` in place so that the total
# length is less than the specified length.
# Account for [CLS], [SEP], [SEP] with "- 3"
_truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
tokens_a, t1_label = random_word(tokens_a, tokenizer)
tokens_b, t2_label = random_word(tokens_b, tokenizer)
# concatenate lm labels and account for CLS, SEP, SEP
lm_label_ids = ([-1] + t1_label + [-1] + t2_label + [-1])
# The convention in BERT is:
# (a) For sequence pairs:
# tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
# type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1
# (b) For single sequences:
# tokens: [CLS] the dog is hairy . [SEP]
# type_ids: 0 0 0 0 0 0 0
#
# Where "type_ids" are used to indicate whether this is the first
# sequence or the second sequence. The embedding vectors for `type=0` and
# `type=1` were learned during pre-training and are added to the wordpiece
# embedding vector (and position vector). This is not *strictly* necessary
# since the [SEP] token unambigiously separates the sequences, but it makes
# it easier for the model to learn the concept of sequences.
#
# For classification tasks, the first vector (corresponding to [CLS]) is
# used as as the "sentence vector". Note that this only makes sense because
# the entire model is fine-tuned.
tokens = []
segment_ids = []
tokens.append("[CLS]")
segment_ids.append(0)
for token in tokens_a:
tokens.append(token)
segment_ids.append(0)
tokens.append("[SEP]")
segment_ids.append(0)
assert len(tokens_b) > 0
for token in tokens_b:
tokens.append(token)
segment_ids.append(1)
tokens.append("[SEP]")
segment_ids.append(1)
input_ids = tokenizer.convert_tokens_to_ids(tokens)
# The mask has 1 for real tokens and 0 for padding tokens. Only real
# tokens are attended to.
input_mask = [1] * len(input_ids)
# Zero-pad up to the sequence length.
while len(input_ids) < max_seq_length:
input_ids.append(0)
input_mask.append(0)
segment_ids.append(0)
lm_label_ids.append(-1)
assert len(input_ids) == max_seq_length
assert len(input_mask) == max_seq_length
assert len(segment_ids) == max_seq_length
assert len(lm_label_ids) == max_seq_length
if example.guid < 5:
logger.info("*** Example ***")
logger.info("guid: %s" % (example.guid))
logger.info("tokens: %s" % " ".join(
[str(x) for x in tokens]))
logger.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
logger.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
logger.info(
"segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
logger.info("LM label: %s " % (lm_label_ids))
logger.info("Is next sentence label: %s " % (example.is_next))
features = InputFeatures(input_ids=input_ids,
input_mask=input_mask,
segment_ids=segment_ids,
lm_label_ids=lm_label_ids,
is_next=example.is_next)
return features
def main():
parser = argparse.ArgumentParser()
## Required parameters
parser.add_argument("--train_corpus",
default=None,
type=str,
required=True,
help="The input train corpus.")
parser.add_argument("--bert_model", default=None, type=str, required=True,
help="Bert pre-trained model selected in the list: bert-base-uncased, "
"bert-large-uncased, bert-base-cased, bert-base-multilingual, bert-base-chinese.")
parser.add_argument("--output_dir",
default=None,
type=str,
required=True,
help="The output directory where the model checkpoints will be written.")
## Other parameters
parser.add_argument("--max_seq_length",
default=128,
type=int,
help="The maximum total input sequence length after WordPiece tokenization. \n"
"Sequences longer than this will be truncated, and sequences shorter \n"
"than this will be padded.")
parser.add_argument("--do_train",
action='store_true',
help="Whether to run training.")
parser.add_argument("--train_batch_size",
default=32,
type=int,
help="Total batch size for training.")
parser.add_argument("--learning_rate",
default=3e-5,
type=float,
help="The initial learning rate for Adam.")
parser.add_argument("--num_train_epochs",
default=3.0,
type=float,
help="Total number of training epochs to perform.")
parser.add_argument("--warmup_proportion",
default=0.1,
type=float,
help="Proportion of training to perform linear learning rate warmup for. "
"E.g., 0.1 = 10%% of training.")
parser.add_argument("--no_cuda",
action='store_true',
help="Whether not to use CUDA when available")
parser.add_argument("--on_memory",
action='store_true',
help="Whether to load train samples into memory or use disk")
parser.add_argument("--do_lower_case",
action='store_true',
help="Whether to lower case the input text. True for uncased models, False for cased models.")
parser.add_argument("--local_rank",
type=int,
default=-1,
help="local_rank for distributed training on gpus")
parser.add_argument('--seed',
type=int,
default=42,
help="random seed for initialization")
parser.add_argument('--gradient_accumulation_steps',
type=int,
default=1,
help="Number of updates steps to accumualte before performing a backward/update pass.")
parser.add_argument('--fp16',
action='store_true',
help="Whether to use 16-bit float precision instead of 32-bit")
parser.add_argument('--loss_scale',
type = float, default = 0,
help = "Loss scaling to improve fp16 numeric stability. Only used when fp16 set to True.\n"
"0 (default value): dynamic loss scaling.\n"
"Positive power of 2: static loss scaling value.\n")
args = parser.parse_args()
if args.local_rank == -1 or args.no_cuda:
device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
n_gpu = torch.cuda.device_count()
else:
torch.cuda.set_device(args.local_rank)
device = torch.device("cuda", args.local_rank)
n_gpu = 1
# Initializes the distributed backend which will take care of sychronizing nodes/GPUs
torch.distributed.init_process_group(backend='nccl')
logger.info("device: {} n_gpu: {}, distributed training: {}, 16-bits training: {}".format(
device, n_gpu, bool(args.local_rank != -1), args.fp16))
if args.gradient_accumulation_steps < 1:
raise ValueError("Invalid gradient_accumulation_steps parameter: {}, should be >= 1".format(
args.gradient_accumulation_steps))
args.train_batch_size = args.train_batch_size // args.gradient_accumulation_steps
random.seed(args.seed)
np.random.seed(args.seed)
torch.manual_seed(args.seed)
if n_gpu > 0:
torch.cuda.manual_seed_all(args.seed)
if not args.do_train:
raise ValueError("Training is currently the only implemented execution option. Please set `do_train`.")
if os.path.exists(args.output_dir) and os.listdir(args.output_dir):
raise ValueError("Output directory ({}) already exists and is not empty.".format(args.output_dir))
if not os.path.exists(args.output_dir):
os.makedirs(args.output_dir)
tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case)
#train_examples = None
num_train_optimization_steps = None
if args.do_train:
print("Loading Train Dataset", args.train_corpus)
train_dataset = BERTDataset(args.train_corpus, tokenizer, seq_len=args.max_seq_length,
corpus_lines=None, on_memory=args.on_memory)
num_train_optimization_steps = int(
len(train_dataset) / args.train_batch_size / args.gradient_accumulation_steps) * args.num_train_epochs
if args.local_rank != -1:
num_train_optimization_steps = num_train_optimization_steps // torch.distributed.get_world_size()
# Prepare model
model = BertForPreTraining.from_pretrained(args.bert_model)
if args.fp16:
model.half()
model.to(device)
if args.local_rank != -1:
try:
from apex.parallel import DistributedDataParallel as DDP
except ImportError:
raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use distributed and fp16 training.")
model = DDP(model)
elif n_gpu > 1:
model = torch.nn.DataParallel(model)
# Prepare optimizer
if args.do_train:
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
{'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
{'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]
if args.fp16:
try:
from apex.optimizers import FP16_Optimizer
from apex.optimizers import FusedAdam
except ImportError:
raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use distributed and fp16 training.")
optimizer = FusedAdam(optimizer_grouped_parameters,
lr=args.learning_rate,
bias_correction=False,
max_grad_norm=1.0)
if args.loss_scale == 0:
optimizer = FP16_Optimizer(optimizer, dynamic_loss_scale=True)
else:
optimizer = FP16_Optimizer(optimizer, static_loss_scale=args.loss_scale)
warmup_linear = WarmupLinearSchedule(warmup=args.warmup_proportion,
t_total=num_train_optimization_steps)
else:
optimizer = BertAdam(optimizer_grouped_parameters,
lr=args.learning_rate,
warmup=args.warmup_proportion,
t_total=num_train_optimization_steps)
global_step = 0
if args.do_train:
logger.info("***** Running training *****")
logger.info(" Num examples = %d", len(train_dataset))
logger.info(" Batch size = %d", args.train_batch_size)
logger.info(" Num steps = %d", num_train_optimization_steps)
if args.local_rank == -1:
train_sampler = RandomSampler(train_dataset)
else:
#TODO: check if this works with current data generator from disk that relies on next(file)
# (it doesn't return item back by index)
train_sampler = DistributedSampler(train_dataset)
train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
model.train()
for _ in trange(int(args.num_train_epochs), desc="Epoch"):
tr_loss = 0
nb_tr_examples, nb_tr_steps = 0, 0
for step, batch in enumerate(tqdm(train_dataloader, desc="Iteration")):
batch = tuple(t.to(device) for t in batch)
input_ids, input_mask, segment_ids, lm_label_ids, is_next = batch
loss = model(input_ids, segment_ids, input_mask, lm_label_ids, is_next)
if n_gpu > 1:
loss = loss.mean() # mean() to average on multi-gpu.
if args.gradient_accumulation_steps > 1:
loss = loss / args.gradient_accumulation_steps
if args.fp16:
optimizer.backward(loss)
else:
loss.backward()
tr_loss += loss.item()
nb_tr_examples += input_ids.size(0)
nb_tr_steps += 1
if (step + 1) % args.gradient_accumulation_steps == 0:
if args.fp16:
# modify learning rate with special warm up BERT uses
# if args.fp16 is False, BertAdam is used that handles this automatically
lr_this_step = args.learning_rate * warmup_linear.get_lr(global_step, args.warmup_proportion)
for param_group in optimizer.param_groups:
param_group['lr'] = lr_this_step
optimizer.step()
optimizer.zero_grad()
global_step += 1
# Save a trained model
logger.info("** ** * Saving fine - tuned model ** ** * ")
model_to_save = model.module if hasattr(model, 'module') else model # Only save the model it-self
output_model_file = os.path.join(args.output_dir, WEIGHTS_NAME)
output_config_file = os.path.join(args.output_dir, CONFIG_NAME)
if args.do_train:
torch.save(model_to_save.state_dict(), output_model_file)
model_to_save.config.to_json_file(output_config_file)
tokenizer.save_vocabulary(args.output_dir)
def _truncate_seq_pair(tokens_a, tokens_b, max_length):
"""Truncates a sequence pair in place to the maximum length."""
# This is a simple heuristic which will always truncate the longer sequence
# one token at a time. This makes more sense than truncating an equal percent
# of tokens from each, since if one sequence is very short then each token
# that's truncated likely contains more information than a longer sequence.
while True:
total_length = len(tokens_a) + len(tokens_b)
if total_length <= max_length:
break
if len(tokens_a) > len(tokens_b):
tokens_a.pop()
else:
tokens_b.pop()
def accuracy(out, labels):
outputs = np.argmax(out, axis=1)
return np.sum(outputs == labels)
if __name__ == "__main__":
main()
================================================
FILE: examples/run_classifier.py
================================================
#coding=utf-8
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""BERT finetuning runner."""
from __future__ import absolute_import, division, print_function
import argparse
import csv
import logging
import os
import random
import sys
sys.path.append('..')
import copy
import numpy as np
import torch
from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
TensorDataset)
from torch.utils.data.distributed import DistributedSampler
from tqdm import tqdm, trange
from torch.nn import CrossEntropyLoss, MSELoss
from scipy.stats import pearsonr, spearmanr
from sklearn.metrics import matthews_corrcoef, f1_score, classification_report
from pytorch_pretrained_bert.file_utils import PYTORCH_PRETRAINED_BERT_CACHE, WEIGHTS_NAME, CONFIG_NAME
from pytorch_pretrained_bert.modeling import BertForSequenceClassification, BertConfig
from pytorch_pretrained_bert.tokenization import BertTokenizer
from pytorch_pretrained_bert.optimization import BertAdam, WarmupLinearSchedule
logger = logging.getLogger(__name__)
class InputExample(object):
"""A single training/test example for simple sequence classification."""
def __init__(self, guid, text_a, text_b=None, label=None, entity_pos=None):
"""Constructs a InputExample.
Args:
guid: Unique id for the example.
text_a: string. The untokenized text of the first sequence. For single
sequence tasks, only this sequence must be specified.
text_b: (Optional) string. The untokenized text of the second sequence.
Only must be specified for sequence pair tasks.
label: (Optional) string. The label of the example. This should be
specified for train and dev examples, but not for test examples.
"""
self.guid = guid
self.text_a = text_a
self.text_b = text_b
self.label = label
self.entity_pos = entity_pos
class InputFeatures(object):
"""A single set of features of data."""
def __init__(self, input_ids, input_mask, segment_ids, label_id, entity_mask=None, entity_seg_pos=None, entity_span1_pos=None, entity_span2_pos=None):
self.input_ids = input_ids
self.input_mask = input_mask
self.segment_ids = segment_ids
self.label_id = label_id
self.entity_mask = entity_mask
self.entity_seg_pos = entity_seg_pos
self.entity_span1_pos = entity_span1_pos
self.entity_span2_pos = entity_span2_pos
class DataProcessor(object):
"""Base class for data converters for sequence classification data sets."""
def get_train_examples(self, data_dir):
"""Gets a collection of `InputExample`s for the train set."""
raise NotImplementedError()
def get_dev_examples(self, data_dir):
"""Gets a collection of `InputExample`s for the dev set."""
raise NotImplementedError()
def get_labels(self):
"""Gets the list of labels for this data set."""
raise NotImplementedError()
@classmethod
def _read_tsv(cls, input_file, quotechar=None):
"""Reads a tab separated value file."""
with open(input_file, "r", encoding="utf-8") as f:
reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
lines = []
for line in reader:
if sys.version_info[0] == 2:
line = list(unicode(cell, 'utf-8') for cell in line)
lines.append(line)
return lines
class MrpcProcessor(DataProcessor):
"""Processor for the MRPC data set (GLUE version)."""
def get_train_examples(self, data_dir):
"""See base class."""
logger.info("LOOKING AT {}".format(os.path.join(data_dir, "train.tsv")))
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
def get_labels(self):
"""See base class."""
return ["0", "1"]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
if i == 0:
continue
guid = "%s-%s" % (set_type, i)
text_a = line[3]
text_b = line[4]
label = line[0]
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
return examples
class SemProcessor(DataProcessor):
"""Processor for the SemEval 2010 Task 8 dataset."""
def get_train_examples(self, data_dir):
"""See base class."""
logger.info("LOOKING AT {}".format(os.path.join(data_dir, "train.jsonl")))
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "train.jsonl")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "test.jsonl")), "dev")
def get_labels(self):
"""See base class."""
return ['Message-Topic(e2,e1)', 'Instrument-Agency(e2,e1)', 'Entity-Origin(e2,e1)', 'Member-Collection(e1,e2)', 'Member-Collection(e2,e1)', 'Other', 'Component-Whole(e1,e2)', 'Product-Producer(e2,e1)', 'Component-Whole(e2,e1)', 'Entity-Destination(e2,e1)', 'Content-Container(e2,e1)', 'Entity-Destination(e1,e2)', 'Instrument-Agency(e1,e2)', 'Cause-Effect(e2,e1)', 'Entity-Origin(e1,e2)', 'Product-Producer(e1,e2)', 'Cause-Effect(e1,e2)', 'Message-Topic(e1,e2)', 'Content-Container(e1,e2)']
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
import json
examples = []
for (i, line) in enumerate(lines):
guid = "%s-%s" % (set_type, i)
line = json.loads(line[0])
text_a = ' '.join(line['tokens'])
label = line['label']
entity_pos = line['entities']
examples.append(
InputExample(guid=guid, text_a=text_a, label=label, entity_pos = entity_pos))
return examples
class MnliProcessor(DataProcessor):
"""Processor for the MultiNLI data set (GLUE version)."""
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "dev_matched.tsv")),
"dev_matched")
def get_labels(self):
"""See base class."""
return ["contradiction", "entailment", "neutral"]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
if i == 0:
continue
guid = "%s-%s" % (set_type, line[0])
text_a = line[8]
text_b = line[9]
label = line[-1]
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
return examples
class MnliMismatchedProcessor(MnliProcessor):
"""Processor for the MultiNLI Mismatched data set (GLUE version)."""
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "dev_mismatched.tsv")),
"dev_matched")
class ColaProcessor(DataProcessor):
"""Processor for the CoLA data set (GLUE version)."""
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
def get_labels(self):
"""See base class."""
return ["0", "1"]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
guid = "%s-%s" % (set_type, i)
text_a = line[3]
label = line[1]
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
return examples
class Sst2Processor(DataProcessor):
"""Processor for the SST-2 data set (GLUE version)."""
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
def get_labels(self):
"""See base class."""
return ["0", "1"]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
if i == 0:
continue
guid = "%s-%s" % (set_type, i)
text_a = line[0]
label = line[1]
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
return examples
class StsbProcessor(DataProcessor):
"""Processor for the STS-B data set (GLUE version)."""
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
def get_labels(self):
"""See base class."""
return [None]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
if i == 0:
continue
guid = "%s-%s" % (set_type, line[0])
text_a = line[7]
text_b = line[8]
label = line[-1]
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
return examples
class QqpProcessor(DataProcessor):
"""Processor for the QQP data set (GLUE version)."""
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
def get_labels(self):
"""See base class."""
return ["0", "1"]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
if i == 0:
continue
guid = "%s-%s" % (set_type, line[0])
try:
text_a = line[3]
text_b = line[4]
label = line[5]
except IndexError:
continue
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
return examples
class QnliProcessor(DataProcessor):
"""Processor for the QNLI data set (GLUE version)."""
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "dev.tsv")),
"dev_matched")
def get_labels(self):
"""See base class."""
return ["entailment", "not_entailment"]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
if i == 0:
continue
guid = "%s-%s" % (set_type, line[0])
text_a = line[1]
text_b = line[2]
label = line[-1]
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
return examples
class RteProcessor(DataProcessor):
"""Processor for the RTE data set (GLUE version)."""
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
def get_labels(self):
"""See base class."""
return ["entailment", "not_entailment"]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
if i == 0:
continue
guid = "%s-%s" % (set_type, line[0])
text_a = line[1]
text_b = line[2]
label = line[-1]
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
return examples
class WnliProcessor(DataProcessor):
"""Processor for the WNLI data set (GLUE version)."""
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
def get_labels(self):
"""See base class."""
return ["0", "1"]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
if i == 0:
continue
guid = "%s-%s" % (set_type, line[0])
text_a = line[1]
text_b = line[2]
label = line[-1]
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
return examples
def convert_examples_to_features(examples, label_list, max_seq_length,
tokenizer, output_mode):
"""Loads a data file into a list of `InputBatch`s."""
label_map = {label : i for i, label in enumerate(label_list)}
features = []
for (ex_index, example) in enumerate(examples):
if ex_index % 10000 == 0:
logger.info("Writing example %d of %d" % (ex_index, len(examples)))
old_entity_pos = copy.deepcopy(example.entity_pos)
tokens_a, new_entity_pos = tokenizer.tokenize(example.text_a,example.entity_pos)
old_entity0 = ''.join(example.text_a.split()[old_entity_pos[0][0]:old_entity_pos[0][1]])
old_entity1 = ''.join(example.text_a.split()[old_entity_pos[1][0]:old_entity_pos[1][1]])
new_entity0 = ''.join(tokens_a[new_entity_pos[0][0]:new_entity_pos[0][1]])
new_entity1 = ''.join(tokens_a[new_entity_pos[1][0]:new_entity_pos[1][1]])
old_entity0 = old_entity0.lower()
old_entity1 = old_entity1.lower()
if '##' in new_entity0 or '##' in new_entity1:
new_entity0 = new_entity0.replace('#','')
new_entity1 = new_entity1.replace('#','')
try:
assert(old_entity0 == new_entity0)
assert(old_entity1 == new_entity1)
except:
import pdb;pdb.set_trace()
# Entity marker
tokens_a_ = copy.deepcopy(tokens_a)
new_entity_pos_ = copy.deepcopy(new_entity_pos)
entity1_start, entity1_end = new_entity_pos[0][0], new_entity_pos[0][1]
entity2_start, entity2_end = new_entity_pos[1][0], new_entity_pos[1][1]
tokens_a.insert(entity1_start, '<s1>')
new_entity_pos[0][0] = entity1_start
tokens_a.insert(entity1_end+1, '<e1>')
new_entity_pos[0][1] = entity1_end+1+1
tokens_a.insert(entity2_start+2, '<s2>')
new_entity_pos[1][0] = entity2_start+2
tokens_a.insert(entity2_end+3,'<e2>')
new_entity_pos[1][1] = entity2_end+3+1
if new_entity_pos[1][1] > max_seq_length - 2 - 1:
import pdb;pdb.set_trace()
tokens_b = None
if example.text_b:
tokens_b = tokenizer.tokenize(example.text_b)
# Modifies `tokens_a` and `tokens_b` in place so that the total
# length is less than the specified length.
# Account for [CLS], [SEP], [SEP] with "- 3"
_truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
else:
# Account for [CLS] and [SEP] with "- 2"
if len(tokens_a) > max_seq_length - 2:
tokens_a = tokens_a[:(max_seq_length - 2)]
# The convention in BERT is:
# (a) For sequence pairs:
# tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
# type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1
# (b) For single sequences:
# tokens: [CLS] the dog is hairy . [SEP]
# type_ids: 0 0 0 0 0 0 0
#
# Where "type_ids" are used to indicate whether this is the first
# sequence or the second sequence. The embedding vectors for `type=0` and
# `type=1` were learned during pre-training and are added to the wordpiece
# embedding vector (and position vector). This is not *strictly* necessary
# since the [SEP] token unambiguously separates the sequences, but it makes
# it easier for the model to learn the concept of sequences.
#
# For classification tasks, the first vector (corresponding to [CLS]) is
# used as as the "sentence vector". Note that this only makes sense because
# the entire model is fine-tuned.
tokens = ["[CLS]"] + tokens_a + ["[SEP]"]
segment_ids = [0] * len(tokens)
if tokens_b:
tokens += tokens_b + ["[SEP]"]
segment_ids += [1] * (len(tokens_b) + 1)
input_ids = tokenizer.convert_tokens_to_ids(tokens)
# The mask has 1 for real tokens and 0 for padding tokens. Only real
# tokens are attended to.
input_mask = [1] * len(input_ids)
# Zero-pad up to the sequence length.
padding = [0] * (max_seq_length - len(input_ids))
input_ids += padding
input_mask += padding
segment_ids += padding
# Used for mention pooling
entity_mask_tag = 1
entity_mask = [0] * len(input_ids)
for entity in new_entity_pos:
start, end = entity[0],entity[1]
for i in range(start, end):
# [CLS], need to +1 offset
entity_mask[i+1] = entity_mask_tag
"""
Different position embedding
"""
# Strategy 1
entity1_pos_tag = 1
entity2_pos_tag = 2
entity_seg_pos = [0] * len(input_ids)
entity1_start, entity1_end = new_entity_pos[0][0], new_entity_pos[0][1]
for i in range(entity1_start, entity1_end):
entity_seg_pos[i+1] = entity1_pos_tag
entity2_start, entity2_end = new_entity_pos[1][0], new_entity_pos[1][1]
for i in range(entity2_start, entity2_end):
entity_seg_pos[i+1] = entity2_pos_tag
# Strategy 2
entity_start_pos_tag = 1
entity_seg_pos_ = [0] * len(input_ids)
entity1_start, entity1_end = new_entity_pos[0][0], new_entity_pos[0][1]
entity_seg_pos_[entity1_start+1] = entity_start_pos_tag
entity2_start, entity2_end = new_entity_pos[1][0], new_entity_pos[1][1]
entity_seg_pos_[entity2_start+1] = entity_start_pos_tag
# Strategy 3
entity_span1_pos = [0] * len(input_ids)
entity1_start, entity1_end = new_entity_pos[0][0], new_entity_pos[0][1]
for i in range(len(entity_span1_pos)):
if i < entity1_start:
#entity_span1_pos[i] = np.abs(i - entity1_start)
entity_span1_pos[i] = i - entity1_start
elif entity1_start <= i and i < entity1_end:
entity_span1_pos[i] = 0
elif i >= entity1_end:
entity_span1_pos[i] = i - entity1_end + 1
entity_span2_pos = [0] * len(input_ids)
entity2_start, entity2_end = new_entity_pos[1][0], new_entity_pos[1][1]
for i in range(len(entity_span2_pos)):
if i < entity2_start:
#entity_span2_pos[i] = np.abs(i - entity2_start)
entity_span2_pos[i] = i - entity2_start
elif entity2_start <= i and i < entity2_end:
entity_span2_pos[i] = 0
elif i >= entity2_end:
entity_span2_pos[i] = i - entity2_end + 1
# Avoid to get negative position to fuck the nn.Embedding
#entity_span1_pos = [pos+max_seq_length-1 for pos in entity_span1_pos]
#entity_span2_pos = [pos+max_seq_length-1 for pos in entity_span2_pos]
assert len(input_ids) == max_seq_length
assert len(input_mask) == max_seq_length
assert len(segment_ids) == max_seq_length
assert len(entity_mask) == max_seq_length
assert len(entity_seg_pos) == max_seq_length
assert len(entity_seg_pos_) == max_seq_length
assert len(entity_span1_pos) == max_seq_length
assert len(entity_span2_pos) == max_seq_length
if output_mode == "classification":
label_id = label_map[example.label]
elif output_mode == "regression":
label_id = float(example.label)
else:
raise KeyError(output_mode)
if ex_index < 5:
logger.info("*** Example ***")
logger.info("guid: %s" % (example.guid))
logger.info("tokens: %s" % " ".join(
[str(x) for x in tokens]))
logger.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
logger.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
logger.info("entity_mask: %s" % " ".join([str(x) for x in entity_mask]))
logger.info("entity_seg_pos: %s" % " ".join([str(x) for x in entity_seg_pos]))
logger.info("entity_seg_pos_: %s" % " ".join([str(x) for x in entity_seg_pos_]))
logger.info("entity_span1_pos: %s" % " ".join([str(x) for x in entity_span1_pos]))
logger.info("entity_span2_pos: %s" % " ".join([str(x) for x in entity_span2_pos]))
logger.info(
"segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
logger.info("label: %s (id = %d)" % (example.label, label_id))
#if example.guid == 'train-3':
# import pdb;pdb.set_trace()
features.append(
InputFeatures(input_ids=input_ids,
input_mask=input_mask,
segment_ids=segment_ids,
label_id=label_id,
entity_mask=entity_mask,
entity_seg_pos=entity_seg_pos_,
entity_span1_pos=entity_span1_pos,
entity_span2_pos=entity_span2_pos))
return features
def _truncate_seq_pair(tokens_a, tokens_b, max_length):
"""Truncates a sequence pair in place to the maximum length."""
# This is a simple heuristic which will always truncate the longer sequence
# one token at a time. This makes more sense than truncating an equal percent
# of tokens from each, since if one sequence is very short then each token
# that's truncated likely contains more information than a longer sequence.
while True:
total_length = len(tokens_a) + len(tokens_b)
if total_length <= max_length:
break
if len(tokens_a) > len(tokens_b):
tokens_a.pop()
else:
tokens_b.pop()
def simple_accuracy(preds, labels):
return (preds == labels).mean()
def acc_and_f1(preds, labels):
acc = simple_accuracy(preds, labels)
f1 = f1_score(y_true=labels, y_pred=preds,average='micro')
report = classification_report(labels, preds)
return {
"acc": acc,
"f1": f1,
"acc_and_f1": (acc + f1) / 2,
"report": report
}
def pearson_and_spearman(preds, labels):
pearson_corr = pearsonr(preds, labels)[0]
spearman_corr = spearmanr(preds, labels)[0]
return {
"pearson": pearson_corr,
"spearmanr": spearman_corr,
"corr": (pearson_corr + spearman_corr) / 2,
}
def compute_metrics(task_name, preds, labels):
assert len(preds) == len(labels)
if task_name == "cola":
return {"mcc": matthews_corrcoef(labels, preds)}
elif task_name == "sst-2":
return {"acc": simple_accuracy(preds, labels)}
elif task_name == "mrpc":
return acc_and_f1(preds, labels)
elif task_name == "sem":
return acc_and_f1(preds, labels)
elif task_name == "sts-b":
return pearson_and_spearman(preds, labels)
elif task_name == "qqp":
return acc_and_f1(preds, labels)
elif task_name == "mnli":
return {"acc": simple_accuracy(preds, labels)}
elif task_name == "mnli-mm":
return {"acc": simple_accuracy(preds, labels)}
elif task_name == "qnli":
return {"acc": simple_accuracy(preds, labels)}
elif task_name == "rte":
return {"acc": simple_accuracy(preds, labels)}
elif task_name == "wnli":
return {"acc": simple_accuracy(preds, labels)}
else:
raise KeyError(task_name)
def main():
parser = argparse.ArgumentParser()
## Required parameters
parser.add_argument("--data_dir",
default=None,
type=str,
required=True,
help="The input data dir. Should contain the .tsv files (or other data files) for the task.")
parser.add_argument("--bert_model", default=None, type=str, required=True,
help="Bert pre-trained model selected in the list: bert-base-uncased, "
"bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, "
"bert-base-multilingual-cased, bert-base-chinese.")
parser.add_argument("--task_name",
default=None,
type=str,
required=True,
help="The name of the task to train.")
parser.add_argument("--output_dir",
default=None,
type=str,
required=True,
help="The output directory where the model predictions and checkpoints will be written.")
## Other parameters
parser.add_argument("--cache_dir",
default="",
type=str,
help="Where do you want to store the pre-trained models downloaded from s3")
parser.add_argument("--max_seq_length",
default=128,
type=int,
help="The maximum total input sequence length after WordPiece tokenization. \n"
"Sequences longer than this will be truncated, and sequences shorter \n"
"than this will be padded.")
parser.add_argument("--do_train",
action='store_true',
help="Whether to run training.")
parser.add_argument("--do_eval",
action='store_true',
help="Whether to run eval on the dev set.")
parser.add_argument("--do_lower_case",
action='store_true',
help="Set this flag if you are using an uncased model.")
parser.add_argument("--train_batch_size",
default=32,
type=int,
help="Total batch size for training.")
parser.add_argument("--eval_batch_size",
default=8,
type=int,
help="Total batch size for eval.")
parser.add_argument("--learning_rate",
default=5e-5,
type=float,
help="The initial learning rate for Adam.")
parser.add_argument("--num_train_epochs",
default=3.0,
type=float,
help="Total number of training epochs to perform.")
parser.add_argument("--warmup_proportion",
default=0.1,
type=float,
help="Proportion of training to perform linear learning rate warmup for. "
"E.g., 0.1 = 10%% of training.")
parser.add_argument("--no_cuda",
action='store_true',
help="Whether not to use CUDA when available")
parser.add_argument("--local_rank",
type=int,
default=-1,
help="local_rank for distributed training on gpus")
parser.add_argument('--seed',
type=int,
default=42,
help="random seed for initialization")
parser.add_argument('--gradient_accumulation_steps',
type=int,
default=1,
help="Number of updates steps to accumulate before performing a backward/update pass.")
parser.add_argument('--fp16',
action='store_true',
help="Whether to use 16-bit float precision instead of 32-bit")
parser.add_argument('--loss_scale',
type=float, default=0,
help="Loss scaling to improve fp16 numeric stability. Only used when fp16 set to True.\n"
"0 (default value): dynamic loss scaling.\n"
"Positive power of 2: static loss scaling value.\n")
parser.add_argument('--server_ip', type=str, default='', help="Can be used for distant debugging.")
parser.add_argument('--server_port', type=str, default='', help="Can be used for distant debugging.")
args = parser.parse_args()
if args.server_ip and args.server_port:
# Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
import ptvsd
print("Waiting for debugger attach")
ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
ptvsd.wait_for_attach()
processors = {
"cola": ColaProcessor,
"mnli": MnliProcessor,
"mnli-mm": MnliMismatchedProcessor,
"mrpc": MrpcProcessor,
"sem": SemProcessor,
"sst-2": Sst2Processor,
"sts-b": StsbProcessor,
"qqp": QqpProcessor,
"qnli": QnliProcessor,
"rte": RteProcessor,
"wnli": WnliProcessor,
}
output_modes = {
"cola": "classification",
"mnli": "classification",
"mrpc": "classification",
"sem": "classification",
"sst-2": "classification",
"sts-b": "regression",
"qqp": "classification",
"qnli": "classification",
"rte": "classification",
"wnli": "classification",
}
if args.local_rank == -1 or args.no_cuda:
device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
n_gpu = torch.cuda.device_count()
else:
torch.cuda.set_device(args.local_rank)
device = torch.device("cuda", args.local_rank)
n_gpu = 1
# Initializes the distributed backend which will take care of sychronizing nodes/GPUs
torch.distributed.init_process_group(backend='nccl')
logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
datefmt = '%m/%d/%Y %H:%M:%S',
level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
logger.info("device: {} n_gpu: {}, distributed training: {}, 16-bits training: {}".format(
device, n_gpu, bool(args.local_rank != -1), args.fp16))
if args.gradient_accumulation_steps < 1:
raise ValueError("Invalid gradient_accumulation_steps parameter: {}, should be >= 1".format(
args.gradient_accumulation_steps))
args.train_batch_size = args.train_batch_size // args.gradient_accumulation_steps
random.seed(args.seed)
np.random.seed(args.seed)
torch.manual_seed(args.seed)
if n_gpu > 0:
torch.cuda.manual_seed_all(args.seed)
if not args.do_train and not args.do_eval:
raise ValueError("At least one of `do_train` or `do_eval` must be True.")
if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train:
raise ValueError("Output directory ({}) already exists and is not empty.".format(args.output_dir))
if not os.path.exists(args.output_dir):
os.makedirs(args.output_dir)
task_name = args.task_name.lower()
if task_name not in processors:
raise ValueError("Task not found: %s" % (task_name))
processor = processors[task_name]()
output_mode = output_modes[task_name]
label_list = processor.get_labels()
num_labels = len(label_list)
tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case)
train_examples = None
num_train_optimization_steps = None
if args.do_train:
train_examples = processor.get_train_examples(args.data_dir)
num_train_optimization_steps = int(
len(train_examples) / args.train_batch_size / args.gradient_accumulation_steps) * args.num_train_epochs
if args.local_rank != -1:
num_train_optimization_steps = num_train_optimization_steps // torch.distributed.get_world_size()
# Prepare model
cache_dir = args.cache_dir if args.cache_dir else os.path.join(str(PYTORCH_PRETRAINED_BERT_CACHE), 'distributed_{}'.format(args.local_rank))
model = BertForSequenceClassification.from_pretrained(args.bert_model,
cache_dir=cache_dir,
num_labels=num_labels)
if args.fp16:
model.half()
model.to(device)
if args.local_rank != -1:
try:
from apex.parallel import DistributedDataParallel as DDP
except ImportError:
raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use distributed and fp16 training.")
model = DDP(model)
elif n_gpu > 1:
model = torch.nn.DataParallel(model)
# Prepare optimizer
if args.do_train:
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
{'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
{'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]
if args.fp16:
try:
from apex.optimizers import FP16_Optimizer
from apex.optimizers import FusedAdam
except ImportError:
raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use distributed and fp16 training.")
optimizer = FusedAdam(optimizer_grouped_parameters,
lr=args.learning_rate,
bias_correction=False,
max_grad_norm=1.0)
if args.loss_scale == 0:
optimizer = FP16_Optimizer(optimizer, dynamic_loss_scale=True)
else:
optimizer = FP16_Optimizer(optimizer, static_loss_scale=args.loss_scale)
warmup_linear = WarmupLinearSchedule(warmup=args.warmup_proportion,
t_total=num_train_optimization_steps)
else:
optimizer = BertAdam(optimizer_grouped_parameters,
lr=args.learning_rate,
warmup=args.warmup_proportion,
t_total=num_train_optimization_steps)
global_step = 0
nb_tr_steps = 0
tr_loss = 0
if args.do_train:
train_features = convert_examples_to_features(
train_examples, label_list, args.max_seq_length, tokenizer, output_mode)
logger.info("***** Running training *****")
logger.info(" Num examples = %d", len(train_examples))
logger.info(" Batch size = %d", args.train_batch_size)
logger.info(" Num steps = %d", num_train_optimization_steps)
all_input_ids = torch.tensor([f.input_ids for f in train_features], dtype=torch.long)
all_input_mask = torch.tensor([f.input_mask for f in train_features], dtype=torch.long)
# FloatTensor(forward)
all_entity_mask = torch.tensor([f.entity_mask for f in train_features], dtype=torch.float)
all_entity_seg_pos = torch.tensor([f.entity_seg_pos for f in train_features], dtype=torch.long)
all_entity_span1_pos = torch.tensor([f.entity_span1_pos for f in train_features], dtype=torch.float)
all_entity_span2_pos = torch.tensor([f.entity_span2_pos for f in train_features], dtype=torch.float)
all_segment_ids = torch.tensor([f.segment_ids for f in train_features], dtype=torch.long)
if output_mode == "classification":
all_label_ids = torch.tensor([f.label_id for f in train_features], dtype=torch.long)
elif output_mode == "regression":
all_label_ids = torch.tensor([f.label_id for f in train_features], dtype=torch.float)
train_data = TensorDataset(all_input_ids, all_input_mask, all_entity_mask, all_entity_seg_pos, all_entity_span1_pos, all_entity_span2_pos, all_segment_ids, all_label_ids)
if args.local_rank == -1:
train_sampler = RandomSampler(train_data)
else:
train_sampler = DistributedSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=args.train_batch_size)
model.train()
for _ in trange(int(args.num_train_epochs), desc="Epoch"):
tr_loss = 0
nb_tr_examples, nb_tr_steps = 0, 0
for step, batch in enumerate(tqdm(train_dataloader, desc="Iteration")):
batch = tuple(t.to(device) for t in batch)
input_ids, input_mask, entity_mask, entity_seg_pos, entity_span1_pos, entity_span2_pos, segment_ids, label_ids = batch
# define a new function to compute loss values for both output_modes
logits = model(input_ids, segment_ids, input_mask, entity_mask, entity_seg_pos, entity_span1_pos, entity_span2_pos, labels=None)
if output_mode == "classification":
loss_fct = CrossEntropyLoss()
loss = loss_fct(logits.view(-1, num_labels), label_ids.view(-1))
elif output_mode == "regression":
loss_fct = MSELoss()
loss = loss_fct(logits.view(-1), label_ids.view(-1))
if n_gpu > 1:
loss = loss.mean() # mean() to average on multi-gpu.
if args.gradient_accumulation_steps > 1:
loss = loss / args.gradient_accumulation_steps
if args.fp16:
optimizer.backward(loss)
else:
loss.backward()
tr_loss += loss.item()
nb_tr_examples += input_ids.size(0)
nb_tr_steps += 1
if (step + 1) % args.gradient_accumulation_steps == 0:
if args.fp16:
# modify learning rate with special warm up BERT uses
# if args.fp16 is False, BertAdam is used that handles this automatically
lr_this_step = args.learning_rate * warmup_linear.get_lr(global_step, args.warmup_proportion)
for param_group in optimizer.param_groups:
param_group['lr'] = lr_this_step
optimizer.step()
optimizer.zero_grad()
global_step += 1
if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
# Save a trained model, configuration and tokenizer
model_to_save = model.module if hasattr(model, 'module') else model # Only save the model it-self
# If we save using the predefined names, we can load using `from_pretrained`
output_model_file = os.path.join(args.output_dir, WEIGHTS_NAME)
output_config_file = os.path.join(args.output_dir, CONFIG_NAME)
torch.save(model_to_save.state_dict(), output_model_file)
model_to_save.config.to_json_file(output_config_file)
tokenizer.save_vocabulary(args.output_dir)
# Load a trained model and vocabulary that you have fine-tuned
model = BertForSequenceClassification.from_pretrained(args.output_dir, num_labels=num_labels)
tokenizer = BertTokenizer.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
else:
model = BertForSequenceClassification.from_pretrained(args.bert_model, num_labels=num_labels)
model.to(device)
if args.do_eval and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
eval_examples = processor.get_dev_examples(args.data_dir)
eval_features = convert_examples_to_features(
eval_examples, label_list, args.max_seq_length, tokenizer, output_mode)
logger.info("***** Running evaluation *****")
logger.info(" Num examples = %d", len(eval_examples))
logger.info(" Batch size = %d", args.eval_batch_size)
all_input_ids = torch.tensor([f.input_ids for f in eval_features], dtype=torch.long)
all_input_mask = torch.tensor([f.input_mask for f in eval_features], dtype=torch.long)
all_entity_mask = torch.tensor([f.entity_mask for f in eval_features], dtype=torch.float)
all_entity_seg_pos = torch.tensor([f.entity_seg_pos for f in eval_features], dtype=torch.long)
all_entity_span1_pos = torch.tensor([f.entity_span1_pos for f in eval_features], dtype=torch.float)
all_entity_span2_pos = torch.tensor([f.entity_span2_pos for f in eval_features], dtype=torch.float)
all_segment_ids = torch.tensor([f.segment_ids for f in eval_features], dtype=torch.long)
if output_mode == "classification":
all_label_ids = torch.tensor([f.label_id for f in eval_features], dtype=torch.long)
elif output_mode == "regression":
all_label_ids = torch.tensor([f.label_id for f in eval_features], dtype=torch.float)
eval_data = TensorDataset(all_input_ids, all_input_mask, all_entity_mask, all_entity_seg_pos, all_entity_span1_pos, all_entity_span2_pos, all_segment_ids, all_label_ids)
# Run prediction for full data
eval_sampler = SequentialSampler(eval_data)
eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=args.eval_batch_size)
model.eval()
eval_loss = 0
nb_eval_steps = 0
preds = []
for input_ids, input_mask, entity_mask, entity_seg_pos, entity_span1_pos, entity_span2_pos, segment_ids, label_ids in tqdm(eval_dataloader, desc="Evaluating"):
input_ids = input_ids.to(device)
input_mask = input_mask.to(device)
entity_mask = entity_mask.to(device)
entity_seg_pos = entity_seg_pos.to(device)
entity_span1_pos = entity_span1_pos.to(device)
entity_span2_pos = entity_span2_pos.to(device)
segment_ids = segment_ids.to(device)
label_ids = label_ids.to(device)
with torch.no_grad():
logits = model(input_ids, segment_ids, input_mask, entity_mask, entity_seg_pos, entity_span1_pos, entity_span2_pos, labels=None)
#logits = model(input_ids, segment_ids, input_mask, labels=None)
# create eval loss and other metric required by the task
if output_mode == "classification":
loss_fct = CrossEntropyLoss()
tmp_eval_loss = loss_fct(logits.view(-1, num_labels), label_ids.view(-1))
elif output_mode == "regression":
loss_fct = MSELoss()
tmp_eval_loss = loss_fct(logits.view(-1), label_ids.view(-1))
eval_loss += tmp_eval_loss.mean().item()
nb_eval_steps += 1
if len(preds) == 0:
preds.append(logits.detach().cpu().numpy())
else:
preds[0] = np.append(
preds[0], logits.detach().cpu().numpy(), axis=0)
eval_loss = eval_loss / nb_eval_steps
preds = preds[0]
if output_mode == "classification":
preds = np.argmax(preds, axis=1)
elif output_mode == "regression":
preds = np.squeeze(preds)
result = compute_metrics(task_name, preds, all_label_ids.numpy())
loss = tr_loss/global_step if args.do_train else None
result['eval_loss'] = eval_loss
result['global_step'] = global_step
result['loss'] = loss
output_eval_file = os.path.join(args.output_dir, "eval_results.txt")
with open(output_eval_file, "w") as writer:
logger.info("***** Eval results *****")
for key in sorted(result.keys()):
logger.info(" %s = %s", key, str(result[key]))
writer.write("%s = %s\n" % (key, str(result[key])))
# hack for MNLI-MM
if task_name == "mnli":
task_name = "mnli-mm"
processor = processors[task_name]()
if os.path.exists(args.output_dir + '-MM') and os.listdir(args.output_dir + '-MM') and args.do_train:
raise ValueError("Output directory ({}) already exists and is not empty.".format(args.output_dir))
if not os.path.exists(args.output_dir + '-MM'):
os.makedirs(args.output_dir + '-MM')
eval_examples = processor.get_dev_examples(args.data_dir)
eval_features = convert_examples_to_features(
eval_examples, label_list, args.max_seq_length, tokenizer, output_mode)
logger.info("***** Running evaluation *****")
logger.info(" Num examples = %d", len(eval_examples))
logger.info(" Batch size = %d", args.eval_batch_size)
all_input_ids = torch.tensor([f.input_ids for f in eval_features], dtype=torch.long)
all_input_mask = torch.tensor([f.input_mask for f in eval_features], dtype=torch.long)
all_segment_ids = torch.tensor([f.segment_ids for f in eval_features], dtype=torch.long)
all_label_ids = torch.tensor([f.label_id for f in eval_features], dtype=torch.long)
eval_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids)
# Run prediction for full data
eval_sampler = SequentialSampler(eval_data)
eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=args.eval_batch_size)
model.eval()
eval_loss = 0
nb_eval_steps = 0
preds = []
for input_ids, input_mask, segment_ids, label_ids in tqdm(eval_dataloader, desc="Evaluating"):
input_ids = input_ids.to(device)
input_mask = input_mask.to(device)
segment_ids = segment_ids.to(device)
label_ids = label_ids.to(device)
with torch.no_grad():
logits = model(input_ids, segment_ids, input_mask, labels=None)
loss_fct = CrossEntropyLoss()
tmp_eval_loss = loss_fct(logits.view(-1, num_labels), label_ids.view(-1))
eval_loss += tmp_eval_loss.mean().item()
nb_eval_steps += 1
if len(preds) == 0:
preds.append(logits.detach().cpu().numpy())
else:
preds[0] = np.append(
preds[0], logits.detach().cpu().numpy(), axis=0)
eval_loss = eval_loss / nb_eval_steps
preds = preds[0]
preds = np.argmax(preds, axis=1)
result = compute_metrics(task_name, preds, all_label_ids.numpy())
loss = tr_loss/global_step if args.do_train else None
result['eval_loss'] = eval_loss
result['global_step'] = global_step
result['loss'] = loss
output_eval_file = os.path.join(args.output_dir + '-MM', "eval_results.txt")
with open(output_eval_file, "w") as writer:
logger.info("***** Eval results *****")
for key in sorted(result.keys()):
logger.info(" %s = %s", key, str(result[key]))
writer.write("%s = %s\n" % (key, str(result[key])))
if __name__ == "__main__":
main()
================================================
FILE: examples/run_classifier_dataset_utils.py
================================================
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" BERT classification fine-tuning: utilities to work with GLUE tasks """
from __future__ import absolute_import, division, print_function
import csv
import logging
import os
import sys
from scipy.stats import pearsonr, spearmanr
from sklearn.metrics import matthews_corrcoef, f1_score
logger = logging.getLogger(__name__)
class InputExample(object):
"""A single training/test example for simple sequence classification."""
def __init__(self, guid, text_a, text_b=None, label=None):
"""Constructs a InputExample.
Args:
guid: Unique id for the example.
text_a: string. The untokenized text of the first sequence. For single
sequence tasks, only this sequence must be specified.
text_b: (Optional) string. The untokenized text of the second sequence.
Only must be specified for sequence pair tasks.
label: (Optional) string. The label of the example. This should be
specified for train and dev examples, but not for test examples.
"""
self.guid = guid
self.text_a = text_a
self.text_b = text_b
self.label = label
class InputFeatures(object):
"""A single set of features of data."""
def __init__(self, input_ids, input_mask, segment_ids, label_id):
self.input_ids = input_ids
self.input_mask = input_mask
self.segment_ids = segment_ids
self.label_id = label_id
class DataProcessor(object):
"""Base class for data converters for sequence classification data sets."""
def get_train_examples(self, data_dir):
"""Gets a collection of `InputExample`s for the train set."""
raise NotImplementedError()
def get_dev_examples(self, data_dir):
"""Gets a collection of `InputExample`s for the dev set."""
raise NotImplementedError()
def get_labels(self):
"""Gets the list of labels for this data set."""
raise NotImplementedError()
@classmethod
def _read_tsv(cls, input_file, quotechar=None):
"""Reads a tab separated value file."""
with open(input_file, "r", encoding="utf-8") as f:
reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
lines = []
for line in reader:
if sys.version_info[0] == 2:
line = list(unicode(cell, 'utf-8') for cell in line)
lines.append(line)
return lines
class MrpcProcessor(DataProcessor):
"""Processor for the MRPC data set (GLUE version)."""
def get_train_examples(self, data_dir):
"""See base class."""
logger.info("LOOKING AT {}".format(os.path.join(data_dir, "train.tsv")))
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
def get_labels(self):
"""See base class."""
return ["0", "1"]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
if i == 0:
continue
guid = "%s-%s" % (set_type, i)
text_a = line[3]
text_b = line[4]
label = line[0]
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
return examples
class MnliProcessor(DataProcessor):
"""Processor for the MultiNLI data set (GLUE version)."""
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "dev_matched.tsv")),
"dev_matched")
def get_labels(self):
"""See base class."""
return ["contradiction", "entailment", "neutral"]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
if i == 0:
continue
guid = "%s-%s" % (set_type, line[0])
text_a = line[8]
text_b = line[9]
label = line[-1]
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
return examples
class MnliMismatchedProcessor(MnliProcessor):
"""Processor for the MultiNLI Mismatched data set (GLUE version)."""
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "dev_mismatched.tsv")),
"dev_matched")
class ColaProcessor(DataProcessor):
"""Processor for the CoLA data set (GLUE version)."""
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
def get_labels(self):
"""See base class."""
return ["0", "1"]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
guid = "%s-%s" % (set_type, i)
text_a = line[3]
label = line[1]
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
return examples
class Sst2Processor(DataProcessor):
"""Processor for the SST-2 data set (GLUE version)."""
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
def get_labels(self):
"""See base class."""
return ["0", "1"]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
if i == 0:
continue
guid = "%s-%s" % (set_type, i)
text_a = line[0]
label = line[1]
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
return examples
class StsbProcessor(DataProcessor):
"""Processor for the STS-B data set (GLUE version)."""
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
def get_labels(self):
"""See base class."""
return [None]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
if i == 0:
continue
guid = "%s-%s" % (set_type, line[0])
text_a = line[7]
text_b = line[8]
label = line[-1]
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
return examples
class QqpProcessor(DataProcessor):
"""Processor for the QQP data set (GLUE version)."""
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
def get_labels(self):
"""See base class."""
return ["0", "1"]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
if i == 0:
continue
guid = "%s-%s" % (set_type, line[0])
try:
text_a = line[3]
text_b = line[4]
label = line[5]
except IndexError:
continue
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
return examples
class QnliProcessor(DataProcessor):
"""Processor for the QNLI data set (GLUE version)."""
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "dev.tsv")),
"dev_matched")
def get_labels(self):
"""See base class."""
return ["entailment", "not_entailment"]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
if i == 0:
continue
guid = "%s-%s" % (set_type, line[0])
text_a = line[1]
text_b = line[2]
label = line[-1]
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
return examples
class RteProcessor(DataProcessor):
"""Processor for the RTE data set (GLUE version)."""
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
def get_labels(self):
"""See base class."""
return ["entailment", "not_entailment"]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
if i == 0:
continue
guid = "%s-%s" % (set_type, line[0])
text_a = line[1]
text_b = line[2]
label = line[-1]
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
return examples
class WnliProcessor(DataProcessor):
"""Processor for the WNLI data set (GLUE version)."""
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
def get_labels(self):
"""See base class."""
return ["0", "1"]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
if i == 0:
continue
guid = "%s-%s" % (set_type, line[0])
text_a = line[1]
text_b = line[2]
label = line[-1]
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
return examples
def convert_examples_to_features(examples, label_list, max_seq_length,
tokenizer, output_mode):
"""Loads a data file into a list of `InputBatch`s."""
label_map = {label : i for i, label in enumerate(label_list)}
features = []
for (ex_index, example) in enumerate(examples):
if ex_index % 10000 == 0:
logger.info("Writing example %d of %d" % (ex_index, len(examples)))
tokens_a = tokenizer.tokenize(example.text_a)
tokens_b = None
if example.text_b:
tokens_b = tokenizer.tokenize(example.text_b)
# Modifies `tokens_a` and `tokens_b` in place so that the total
# length is less than the specified length.
# Account for [CLS], [SEP], [SEP] with "- 3"
_truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
else:
# Account for [CLS] and [SEP] with "- 2"
if len(tokens_a) > max_seq_length - 2:
tokens_a = tokens_a[:(max_seq_length - 2)]
# The convention in BERT is:
# (a) For sequence pairs:
# tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
# type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1
# (b) For single sequences:
# tokens: [CLS] the dog is hairy . [SEP]
# type_ids: 0 0 0 0 0 0 0
#
# Where "type_ids" are used to indicate whether this is the first
# sequence or the second sequence. The embedding vectors for `type=0` and
# `type=1` were learned during pre-training and are added to the wordpiece
# embedding vector (and position vector). This is not *strictly* necessary
# since the [SEP] token unambiguously separates the sequences, but it makes
# it easier for the model to learn the concept of sequences.
#
# For classification tasks, the first vector (corresponding to [CLS]) is
# used as as the "sentence vector". Note that this only makes sense because
# the entire model is fine-tuned.
tokens = ["[CLS]"] + tokens_a + ["[SEP]"]
segment_ids = [0] * len(tokens)
if tokens_b:
tokens += tokens_b + ["[SEP]"]
segment_ids += [1] * (len(tokens_b) + 1)
input_ids = tokenizer.convert_tokens_to_ids(tokens)
# The mask has 1 for real tokens and 0 for padding tokens. Only real
# tokens are attended to.
input_mask = [1] * len(input_ids)
# Zero-pad up to the sequence length.
padding = [0] * (max_seq_length - len(input_ids))
input_ids += padding
input_mask += padding
segment_ids += padding
assert len(input_ids) == max_seq_length
assert len(input_mask) == max_seq_length
assert len(segment_ids) == max_seq_length
if output_mode == "classification":
label_id = label_map[example.label]
elif output_mode == "regression":
label_id = float(example.label)
else:
raise KeyError(output_mode)
if ex_index < 5:
logger.info("*** Example ***")
logger.info("guid: %s" % (example.guid))
logger.info("tokens: %s" % " ".join(
[str(x) for x in tokens]))
logger.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
logger.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
logger.info(
"segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
logger.info("label: %s (id = %d)" % (example.label, label_id))
features.append(
InputFeatures(input_ids=input_ids,
input_mask=input_mask,
segment_ids=segment_ids,
label_id=label_id))
return features
def _truncate_seq_pair(tokens_a, tokens_b, max_length):
"""Truncates a sequence pair in place to the maximum length."""
# This is a simple heuristic which will always truncate the longer sequence
# one token at a time. This makes more sense than truncating an equal percent
# of tokens from each, since if one sequence is very short then each token
# that's truncated likely contains more information than a longer sequence.
while True:
total_length = len(tokens_a) + len(tokens_b)
if total_length <= max_length:
break
if len(tokens_a) > len(tokens_b):
tokens_a.pop()
else:
tokens_b.pop()
def simple_accuracy(preds, labels):
return (preds == labels).mean()
def acc_and_f1(preds, labels):
acc = simple_accuracy(preds, labels)
f1 = f1_score(y_true=labels, y_pred=preds)
return {
"acc": acc,
"f1": f1,
"acc_and_f1": (acc + f1) / 2,
}
def pearson_and_spearman(preds, labels):
pearson_corr = pearsonr(preds, labels)[0]
spearman_corr = spearmanr(preds, labels)[0]
return {
"pearson": pearson_corr,
"spearmanr": spearman_corr,
"corr": (pearson_corr + spearman_corr) / 2,
}
def compute_metrics(task_name, preds, labels):
assert len(preds) == len(labels)
if task_name == "cola":
return {"mcc": matthews_corrcoef(labels, preds)}
elif task_name == "sst-2":
return {"acc": simple_accuracy(preds, labels)}
elif task_name == "mrpc":
return acc_and_f1(preds, labels)
elif task_name == "sts-b":
return pearson_and_spearman(preds, labels)
elif task_name == "qqp":
return acc_and_f1(preds, labels)
elif task_name == "mnli":
return {"acc": simple_accuracy(preds, labels)}
elif task_name == "mnli-mm":
return {"acc": simple_accuracy(preds, labels)}
elif task_name == "qnli":
return {"acc": simple_accuracy(preds, labels)}
elif task_name == "rte":
return {"acc": simple_accuracy(preds, labels)}
elif task_name == "wnli":
return {"acc": simple_accuracy(preds, labels)}
else:
raise KeyError(task_name)
processors = {
"cola": ColaProcessor,
"mnli": MnliProcessor,
"mnli-mm": MnliMismatchedProcessor,
"mrpc": MrpcProcessor,
"sst-2": Sst2Processor,
"sts-b": StsbProcessor,
"qqp": QqpProcessor,
"qnli": QnliProcessor,
"rte": RteProcessor,
"wnli": WnliProcessor,
}
output_modes = {
"cola": "classification",
"mnli": "classification",
"mrpc": "classification",
"sst-2": "classification",
"sts-b": "regression",
"qqp": "classification",
"qnli": "classification",
"rte": "classification",
"wnli": "classification",
}
================================================
FILE: examples/run_gpt2.py
================================================
#!/usr/bin/env python3
import argparse
import logging
from tqdm import trange
import torch
import torch.nn.functional as F
import numpy as np
from pytorch_pretrained_bert import GPT2LMHeadModel, GPT2Tokenizer
logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
datefmt = '%m/%d/%Y %H:%M:%S',
level = logging.INFO)
logger = logging.getLogger(__name__)
def top_k_logits(logits, k):
"""
Masks everything but the k top entries as -infinity (1e10).
Used to mask logits such that e^-infinity -> 0 won't contribute to the
sum of the denominator.
"""
if k == 0:
return logits
else:
values = torch.topk(logits, k)[0]
batch_mins = values[:, -1].view(-1, 1).expand_as(logits)
return torch.where(logits < batch_mins, torch.ones_like(logits) * -1e10, logits)
def sample_sequence(model, length, start_token=None, batch_size=None, context=None, temperature=1, top_k=0, device='cuda', sample=True):
if start_token is None:
assert context is not None, 'Specify exactly one of start_token and context!'
context = torch.tensor(context, device=device, dtype=torch.long).unsqueeze(0).repeat(batch_size, 1)
else:
assert context is None, 'Specify exactly one of start_token and context!'
context = torch.full((batch_size, 1), start_token, device=device, dtype=torch.long)
prev = context
output = context
past = None
with torch.no_grad():
for i in trange(length):
logits, past = model(prev, past=past)
logits = logits[:, -1, :] / temperature
logits = top_k_logits(logits, k=top_k)
log_probs = F.softmax(logits, dim=-1)
if sample:
prev = torch.multinomial(log_probs, num_samples=1)
else:
_, prev = torch.topk(log_probs, k=1, dim=-1)
output = torch.cat((output, prev), dim=1)
return output
def run_model():
parser = argparse.ArgumentParser()
parser.add_argument('--model_name_or_path', type=str, default='gpt2', help='pretrained model name or path to local checkpoint')
parser.add_argument("--seed", type=int, default=0)
parser.add_argument("--nsamples", type=int, default=1)
parser.add_argument("--batch_size", type=int, default=-1)
parser.add_argument("--length", type=int, default=-1)
parser.add_argument("--temperature", type=float, default=1.0)
parser.add_argument("--top_k", type=int, default=0)
parser.add_argument('--unconditional', action='store_true', help='If true, unconditional generation.')
args = parser.parse_args()
print(args)
if args.batch_size == -1:
args.batch_size = 1
assert args.nsamples % args.batch_size == 0
np.random.seed(args.seed)
torch.random.manual_seed(args.seed)
torch.cuda.manual_seed(args.seed)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
enc = GPT2Tokenizer.from_pretrained(args.model_name_or_path)
model = GPT2LMHeadModel.from_pretrained(args.model_name_or_path)
model.to(device)
model.eval()
if args.length == -1:
args.length = model.config.n_ctx // 2
elif args.length > model.config.n_ctx:
raise ValueError("Can't get samples longer than window size: %s" % model.config.n_ctx)
while True:
context_tokens = []
if not args.unconditional:
raw_text = input("Model prompt >>> ")
while not raw_text:
print('Prompt should not be empty!')
raw_text = input("Model prompt >>> ")
context_tokens = enc.encode(raw_text)
generated = 0
for _ in range(args.nsamples // args.batch_size):
out = sample_sequence(
model=model, length=args.length,
context=context_tokens,
start_token=None,
batch_size=args.batch_size,
temperature=args.temperature, top_k=args.top_k, device=device
)
out = out[:, len(context_tokens):].tolist()
for i in range(args.batch_size):
generated += 1
text = enc.decode(out[i])
print("=" * 40 + " SAMPLE " + str(generated) + " " + "=" * 40)
print(text)
print("=" * 80)
else:
generated = 0
for _ in range(args.nsamples // args.batch_size):
out = sample_sequence(
model=model, length=args.length,
context=None,
start_token=enc.encoder['<|endoftext|>'],
batch_size=args.batch_size,
temperature=args.temperature, top_k=args.top_k, device=device
)
out = out[:,1:].tolist()
for i in range(args.batch_size):
generated += 1
text = enc.decode(out[i])
print("=" * 40 + " SAMPLE " + str(generated) + " " + "=" * 40)
print(text)
print("=" * 80)
if __name__ == '__main__':
run_model()
================================================
FILE: examples/run_openai_gpt.py
================================================
# coding=utf-8
# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" OpenAI GPT model fine-tuning script.
Adapted from https://github.com/huggingface/pytorch-openai-transformer-lm/blob/master/train.py
It self adapted from https://github.com/openai/finetune-transformer-lm/blob/master/train.py
This script with default values fine-tunes and evaluate a pretrained OpenAI GPT on the RocStories dataset:
python run_openai_gpt.py \
--model_name openai-gpt \
--do_train \
--do_eval \
--train_dataset $ROC_STORIES_DIR/cloze_test_val__spring2016\ -\ cloze_test_ALL_val.csv \
--eval_dataset $ROC_STORIES_DIR/cloze_test_test__spring2016\ -\ cloze_test_ALL_test.csv \
--output_dir ../log \
--train_batch_size 16 \
"""
import argparse
import os
import csv
import random
import logging
from tqdm import tqdm, trange
import numpy as np
import torch
from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
TensorDataset)
from pytorch_pretrained_bert import (OpenAIGPTDoubleHeadsModel, OpenAIGPTTokenizer,
OpenAIAdam, cached_path, WEIGHTS_NAME, CONFIG_NAME)
ROCSTORIES_URL = "https://s3.amazonaws.com/datasets.huggingface.co/ROCStories.tar.gz"
logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
datefmt = '%m/%d/%Y %H:%M:%S',
level = logging.INFO)
logger = logging.getLogger(__name__)
def accuracy(out, labels):
outputs = np.argmax(out, axis=1)
return np.sum(outputs == labels)
def load_rocstories_dataset(dataset_path):
""" Output a list of tuples(story, 1st continuation, 2nd continuation, label) """
with open(dataset_path, encoding='utf_8') as f:
f = csv.reader(f)
output = []
next(f) # skip the first line
for line in tqdm(f):
output.append((' '.join(line[1:5]), line[5], line[6], int(line[-1])-1))
return output
def pre_process_datasets(encoded_datasets, input_len, cap_length, start_token, delimiter_token, clf_token):
""" Pre-process datasets containing lists of tuples(story, 1st continuation, 2nd continuation, label)
To Transformer inputs of shape (n_batch, n_alternative, length) comprising for each batch, continuation:
input_ids[batch, alternative, :] = [start_token] + story[:cap_length] + [delimiter_token] + cont1[:cap_length] + [clf_token]
"""
tensor_datasets = []
for dataset in encoded_datasets:
n_batch = len(dataset)
input_ids = np.zeros((n_batch, 2, input_len), dtype=np.int64)
mc_token_ids = np.zeros((n_batch, 2), dtype=np.int64)
lm_labels = np.full((n_batch, 2, input_len), fill_value=-1, dtype=np.int64)
mc_labels = np.zeros((n_batch,), dtype=np.int64)
for i, (story, cont1, cont2, mc_label), in enumerate(dataset):
with_cont1 = [start_token] + story[:cap_length] + [delimiter_token] + cont1[:cap_length] + [clf_token]
with_cont2 = [start_token] + story[:cap_length] + [delimiter_token] + cont2[:cap_length] + [clf_token]
input_ids[i, 0, :len(with_cont1)] = with_cont1
input_ids[i, 1, :len(with_cont2)] = with_cont2
mc_token_ids[i, 0] = len(with_cont1) - 1
mc_token_ids[i, 1] = len(with_cont2) - 1
lm_labels[i, 0, :len(with_cont1)] = with_cont1
lm_labels[i, 1, :len(with_cont2)] = with_cont2
mc_labels[i] = mc_label
all_inputs = (input_ids, mc_token_ids, lm_labels, mc_labels)
tensor_datasets.append(tuple(torch.tensor(t) for t in all_inputs))
return tensor_datasets
def main():
parser = argparse.ArgumentParser()
parser.add_argument('--model_name', type=str, default='openai-gpt',
help='pretrained model name')
parser.add_argument("--do_train", action='store_true', help="Whether to run training.")
parser.add_argument("--do_eval", action='store_true', help="Whether to run eval on the dev set.")
parser.add_argument("--output_dir", default=None, type=str, required=True,
help="The output directory where the model predictions and checkpoints will be written.")
parser.add_argument('--train_dataset', type=str, default='')
parser.add_argument('--eval_dataset', type=str, default='')
parser.add_argument('--seed', type=int, default=42)
parser.add_argument('--num_train_epochs', type=int, default=3)
parser.add_argument('--train_batch_size', type=int, default=8)
parser.add_argument('--eval_batch_size', type=int, default=16)
parser.add_argument('--max_grad_norm', type=int, default=1)
parser.add_argument('--learning_rate', type=float, default=6.25e-5)
parser.add_argument('--warmup_proportion', type=float, default=0.002)
parser.add_argument('--lr_schedule', type=str, default='warmup_linear')
parser.add_argument('--weight_decay', type=float, default=0.01)
parser.add_argument('--lm_coef', type=float, default=0.9)
parser.add_argument('--n_valid', type=int, default=374)
parser.add_argument('--server_ip', type=str, default='', help="Can be used for distant debugging.")
parser.add_argument('--server_port', type=str, default='', help="Can be used for distant debugging.")
args = parser.parse_args()
print(args)
if args.server_ip and args.server_port:
# Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
import ptvsd
print("Waiting for debugger attach")
ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
ptvsd.wait_for_attach()
random.seed(args.seed)
np.random.seed(args.seed)
torch.manual_seed(args.seed)
torch.cuda.manual_seed_all(args.seed)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
logger.info("device: {}, n_gpu {}".format(device, n_gpu))
if not args.do_train and not args.do_eval:
raise ValueError("At least one of `do_train` or `do_eval` must be True.")
if not os.path.exists(args.output_dir):
os.makedirs(args.output_dir)
# Load tokenizer and model
# This loading functions also add new tokens and embeddings called `special tokens`
# These new embeddings will be fine-tuned on the RocStories dataset
special_tokens = ['_start_', '_delimiter_', '_classify_']
tokenizer = OpenAIGPTTokenizer.from_pretrained(args.model_name, special_tokens=special_tokens)
special_tokens_ids = list(tokenizer.convert_tokens_to_ids(token) for token in special_tokens)
model = OpenAIGPTDoubleHeadsModel.from_pretrained(args.model_name, num_special_tokens=len(special_tokens))
model.to(device)
# Load and encode the datasets
if not args.train_dataset and not args.eval_dataset:
roc_stories = cached_path(ROCSTORIES_URL)
def tokenize_and_encode(obj):
""" Tokenize and encode a nested object """
if isinstance(obj, str):
return tokenizer.convert_tokens_to_ids(tokenizer.tokenize(obj))
elif isinstance(obj, int):
return obj
return list(tokenize_and_encode(o) for o in obj)
logger.info("Encoding dataset...")
train_dataset = load_rocstories_dataset(args.train_dataset)
eval_dataset = load_rocstories_dataset(args.eval_dataset)
datasets = (train_dataset, eval_dataset)
encoded_datasets = tokenize_and_encode(datasets)
# Compute the max input length for the Transformer
max_length = model.config.n_positions // 2 - 2
input_length = max(len(story[:max_length]) + max(len(cont1[:max_length]), len(cont2[:max_length])) + 3 \
for dataset in encoded_datasets for story, cont1, cont2, _ in dataset)
input_length = min(input_length, model.config.n_positions) # Max size of input for the pre-trained model
# Prepare inputs tensors and dataloaders
tensor_datasets = pre_process_datasets(encoded_datasets, input_length, max_length, *special_tokens_ids)
train_tensor_dataset, eval_tensor_dataset = tensor_datasets[0], tensor_datasets[1]
train_data = TensorDataset(*train_tensor_dataset)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=args.train_batch_size)
eval_data = TensorDataset(*eval_tensor_dataset)
eval_sampler = SequentialSampler(eval_data)
eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=args.eval_batch_size)
# Prepare optimizer
if args.do_train:
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
{'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
{'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]
num_train_optimization_steps = len(train_dataloader) * args.num_train_epochs
optimizer = OpenAIAdam(optimizer_grouped_parameters,
lr=args.learning_rate,
warmup=args.warmup_proportion,
max_grad_norm=args.max_grad_norm,
weight_decay=args.weight_decay,
t_total=num_train_optimization_steps)
if args.do_train:
nb_tr_steps, tr_loss, exp_average_loss = 0, 0, None
model.train()
for _ in trange(int(args.num_train_epochs), desc="Epoch"):
tr_loss = 0
nb_tr_steps = 0
tqdm_bar = tqdm(train_dataloader, desc="Training")
for step, batch in enumerate(tqdm_bar):
batch = tuple(t.to(device) for t in batch)
input_ids, mc_token_ids, lm_labels, mc_labels = batch
losses = model(input_ids, mc_token_ids, lm_labels, mc_labels)
loss = args.lm_coef * losses[0] + losses[1]
loss.backward()
optimizer.step()
optimizer.zero_grad()
tr_loss += loss.item()
exp_average_loss = loss.item() if exp_average_loss is None else 0.7*exp_average_loss+0.3*loss.item()
nb_tr_steps += 1
tqdm_bar.desc = "Training loss: {:.2e} lr: {:.2e}".format(exp_average_loss, optimizer.get_lr()[0])
# Save a trained model
if args.do_train:
# Save a trained model, configuration and tokenizer
model_to_save = model.module if hasattr(model, 'module') else model # Only save the model it-self
# If we save using the predefined names, we can load using `from_pretrained`
output_model_file = os.path.join(args.output_dir, WEIGHTS_NAME)
output_config_file = os.path.join(args.output_dir, CONFIG_NAME)
torch.save(model_to_save.state_dict(), output_model_file)
model_to_save.config.to_json_file(output_config_file)
tokenizer.save_vocabul
gitextract_mjpetdbh/
├── LICENSE
├── MANIFEST.in
├── README.md
├── docker/
│ └── Dockerfile
├── examples/
│ ├── bertology.py
│ ├── extract_features.py
│ ├── lm_finetuning/
│ │ ├── README.md
│ │ ├── finetune_on_pregenerated.py
│ │ ├── pregenerate_training_data.py
│ │ └── simple_lm_finetuning.py
│ ├── run_classifier.py
│ ├── run_classifier_dataset_utils.py
│ ├── run_gpt2.py
│ ├── run_openai_gpt.py
│ ├── run_squad.py
│ ├── run_squad_dataset_utils.py
│ ├── run_swag.py
│ ├── run_transfo_xl.py
│ ├── sem_run_classifier.py
│ ├── tacred_run_classifier.py
│ ├── tacred_run_infer.py
│ ├── test.sh
│ └── train.sh
├── hubconf.py
├── hubconfs/
│ ├── bert_hubconf.py
│ ├── gpt2_hubconf.py
│ ├── gpt_hubconf.py
│ └── transformer_xl_hubconf.py
├── notebooks/
│ ├── Comparing-PT-and-TF-models.ipynb
│ ├── Comparing-TF-and-PT-models-MLM-NSP.ipynb
│ ├── Comparing-TF-and-PT-models-SQuAD.ipynb
│ └── Comparing-TF-and-PT-models.ipynb
├── pytorch_pretrained_bert/
│ ├── __init__.py
│ ├── __main__.py
│ ├── convert_gpt2_checkpoint_to_pytorch.py
│ ├── convert_openai_checkpoint_to_pytorch.py
│ ├── convert_pytorch_checkpoint_to_tf.py
│ ├── convert_tf_checkpoint_to_pytorch.py
│ ├── convert_transfo_xl_checkpoint_to_pytorch.py
│ ├── file_utils.py
│ ├── modeling.py
│ ├── modeling_gpt2.py
│ ├── modeling_openai.py
│ ├── modeling_transfo_xl.py
│ ├── modeling_transfo_xl_utilities.py
│ ├── optimization.py
│ ├── optimization_openai.py
│ ├── tokenization.py
│ ├── tokenization_gpt2.py
│ ├── tokenization_openai.py
│ └── tokenization_transfo_xl.py
├── requirements.txt
├── samples/
│ ├── input.txt
│ └── sample_text.txt
├── setup.py
└── tests/
├── conftest.py
├── modeling_gpt2_test.py
├── modeling_openai_test.py
├── modeling_test.py
├── modeling_transfo_xl_test.py
├── optimization_test.py
├── tokenization_gpt2_test.py
├── tokenization_openai_test.py
├── tokenization_test.py
└── tokenization_transfo_xl_test.py
SYMBOL INDEX (925 symbols across 48 files)
FILE: examples/bertology.py
function entropy (line 23) | def entropy(p):
function print_1d_tensor (line 29) | def print_1d_tensor(tensor, prefix=""):
function print_2d_tensor (line 36) | def print_2d_tensor(tensor):
function compute_heads_importance (line 42) | def compute_heads_importance(args, model, eval_dataloader, compute_entro...
function run_model (line 110) | def run_model():
FILE: examples/extract_features.py
class InputExample (line 40) | class InputExample(object):
method __init__ (line 42) | def __init__(self, unique_id, text_a, text_b):
class InputFeatures (line 48) | class InputFeatures(object):
method __init__ (line 51) | def __init__(self, unique_id, tokens, input_ids, input_mask, input_typ...
function convert_examples_to_features (line 59) | def convert_examples_to_features(examples, seq_length, tokenizer):
function _truncate_seq_pair (line 150) | def _truncate_seq_pair(tokens_a, tokens_b, max_length):
function read_examples (line 167) | def read_examples(input_file):
function main (line 191) | def main():
FILE: examples/lm_finetuning/finetune_on_pregenerated.py
function convert_example_to_features (line 27) | def convert_example_to_features(example, tokenizer, max_seq_length):
class PregeneratedDataset (line 58) | class PregeneratedDataset(Dataset):
method __init__ (line 59) | def __init__(self, training_path, epoch, tokenizer, num_data_epochs, r...
method __len__ (line 113) | def __len__(self):
method __getitem__ (line 116) | def __getitem__(self, item):
function main (line 124) | def main():
FILE: examples/lm_finetuning/pregenerate_training_data.py
class DocumentDatabase (line 14) | class DocumentDatabase:
method __init__ (line 15) | def __init__(self, reduce_memory=False):
method add_document (line 33) | def add_document(self, document):
method _precalculate_doc_weights (line 43) | def _precalculate_doc_weights(self):
method sample_doc (line 47) | def sample_doc(self, current_idx, sentence_weighted=True):
method __len__ (line 66) | def __len__(self):
method __getitem__ (line 69) | def __getitem__(self, item):
method __enter__ (line 75) | def __enter__(self):
method __exit__ (line 78) | def __exit__(self, exc_type, exc_val, traceback):
function truncate_seq_pair (line 85) | def truncate_seq_pair(tokens_a, tokens_b, max_num_tokens):
function create_masked_lm_predictions (line 105) | def create_masked_lm_predictions(tokens, masked_lm_prob, max_predictions...
function create_instances_from_document (line 170) | def create_instances_from_document(
function create_training_file (line 268) | def create_training_file(docs, vocab_list, args, epoch_num):
function main (line 290) | def main():
FILE: examples/lm_finetuning/simple_lm_finetuning.py
class BERTDataset (line 43) | class BERTDataset(Dataset):
method __init__ (line 44) | def __init__(self, corpus_path, tokenizer, seq_len, encoding="utf-8", ...
method __len__ (line 109) | def __len__(self):
method __getitem__ (line 113) | def __getitem__(self, item):
method random_sent (line 142) | def random_sent(self, index):
method get_corpus_line (line 160) | def get_corpus_line(self, item):
method get_random_line (line 197) | def get_random_line(self):
method get_next_line (line 220) | def get_next_line(self):
class InputExample (line 235) | class InputExample(object):
method __init__ (line 238) | def __init__(self, guid, tokens_a, tokens_b=None, is_next=None, lm_lab...
class InputFeatures (line 257) | class InputFeatures(object):
method __init__ (line 260) | def __init__(self, input_ids, input_mask, segment_ids, is_next, lm_lab...
function random_word (line 268) | def random_word(tokens, tokenizer):
function convert_example_to_features (line 307) | def convert_example_to_features(example, max_seq_length, tokenizer):
function main (line 401) | def main():
function _truncate_seq_pair (line 626) | def _truncate_seq_pair(tokens_a, tokens_b, max_length):
function accuracy (line 643) | def accuracy(out, labels):
FILE: examples/run_classifier.py
class InputExample (line 51) | class InputExample(object):
method __init__ (line 54) | def __init__(self, guid, text_a, text_b=None, label=None, entity_pos=N...
class InputFeatures (line 72) | class InputFeatures(object):
method __init__ (line 75) | def __init__(self, input_ids, input_mask, segment_ids, label_id, entit...
class DataProcessor (line 86) | class DataProcessor(object):
method get_train_examples (line 89) | def get_train_examples(self, data_dir):
method get_dev_examples (line 93) | def get_dev_examples(self, data_dir):
method get_labels (line 97) | def get_labels(self):
method _read_tsv (line 102) | def _read_tsv(cls, input_file, quotechar=None):
class MrpcProcessor (line 114) | class MrpcProcessor(DataProcessor):
method get_train_examples (line 117) | def get_train_examples(self, data_dir):
method get_dev_examples (line 123) | def get_dev_examples(self, data_dir):
method get_labels (line 128) | def get_labels(self):
method _create_examples (line 132) | def _create_examples(self, lines, set_type):
class SemProcessor (line 146) | class SemProcessor(DataProcessor):
method get_train_examples (line 149) | def get_train_examples(self, data_dir):
method get_dev_examples (line 155) | def get_dev_examples(self, data_dir):
method get_labels (line 160) | def get_labels(self):
method _create_examples (line 164) | def _create_examples(self, lines, set_type):
class MnliProcessor (line 179) | class MnliProcessor(DataProcessor):
method get_train_examples (line 182) | def get_train_examples(self, data_dir):
method get_dev_examples (line 187) | def get_dev_examples(self, data_dir):
method get_labels (line 193) | def get_labels(self):
method _create_examples (line 197) | def _create_examples(self, lines, set_type):
class MnliMismatchedProcessor (line 212) | class MnliMismatchedProcessor(MnliProcessor):
method get_dev_examples (line 215) | def get_dev_examples(self, data_dir):
class ColaProcessor (line 222) | class ColaProcessor(DataProcessor):
method get_train_examples (line 225) | def get_train_examples(self, data_dir):
method get_dev_examples (line 230) | def get_dev_examples(self, data_dir):
method get_labels (line 235) | def get_labels(self):
method _create_examples (line 239) | def _create_examples(self, lines, set_type):
class Sst2Processor (line 251) | class Sst2Processor(DataProcessor):
method get_train_examples (line 254) | def get_train_examples(self, data_dir):
method get_dev_examples (line 259) | def get_dev_examples(self, data_dir):
method get_labels (line 264) | def get_labels(self):
method _create_examples (line 268) | def _create_examples(self, lines, set_type):
class StsbProcessor (line 282) | class StsbProcessor(DataProcessor):
method get_train_examples (line 285) | def get_train_examples(self, data_dir):
method get_dev_examples (line 290) | def get_dev_examples(self, data_dir):
method get_labels (line 295) | def get_labels(self):
method _create_examples (line 299) | def _create_examples(self, lines, set_type):
class QqpProcessor (line 314) | class QqpProcessor(DataProcessor):
method get_train_examples (line 317) | def get_train_examples(self, data_dir):
method get_dev_examples (line 322) | def get_dev_examples(self, data_dir):
method get_labels (line 327) | def get_labels(self):
method _create_examples (line 331) | def _create_examples(self, lines, set_type):
class QnliProcessor (line 349) | class QnliProcessor(DataProcessor):
method get_train_examples (line 352) | def get_train_examples(self, data_dir):
method get_dev_examples (line 357) | def get_dev_examples(self, data_dir):
method get_labels (line 363) | def get_labels(self):
method _create_examples (line 367) | def _create_examples(self, lines, set_type):
class RteProcessor (line 382) | class RteProcessor(DataProcessor):
method get_train_examples (line 385) | def get_train_examples(self, data_dir):
method get_dev_examples (line 390) | def get_dev_examples(self, data_dir):
method get_labels (line 395) | def get_labels(self):
method _create_examples (line 399) | def _create_examples(self, lines, set_type):
class WnliProcessor (line 414) | class WnliProcessor(DataProcessor):
method get_train_examples (line 417) | def get_train_examples(self, data_dir):
method get_dev_examples (line 422) | def get_dev_examples(self, data_dir):
method get_labels (line 427) | def get_labels(self):
method _create_examples (line 431) | def _create_examples(self, lines, set_type):
function convert_examples_to_features (line 446) | def convert_examples_to_features(examples, label_list, max_seq_length,
function _truncate_seq_pair (line 650) | def _truncate_seq_pair(tokens_a, tokens_b, max_length):
function simple_accuracy (line 667) | def simple_accuracy(preds, labels):
function acc_and_f1 (line 671) | def acc_and_f1(preds, labels):
function pearson_and_spearman (line 683) | def pearson_and_spearman(preds, labels):
function compute_metrics (line 693) | def compute_metrics(task_name, preds, labels):
function main (line 721) | def main():
FILE: examples/run_classifier_dataset_utils.py
class InputExample (line 31) | class InputExample(object):
method __init__ (line 34) | def __init__(self, guid, text_a, text_b=None, label=None):
class InputFeatures (line 52) | class InputFeatures(object):
method __init__ (line 55) | def __init__(self, input_ids, input_mask, segment_ids, label_id):
class DataProcessor (line 62) | class DataProcessor(object):
method get_train_examples (line 65) | def get_train_examples(self, data_dir):
method get_dev_examples (line 69) | def get_dev_examples(self, data_dir):
method get_labels (line 73) | def get_labels(self):
method _read_tsv (line 78) | def _read_tsv(cls, input_file, quotechar=None):
class MrpcProcessor (line 90) | class MrpcProcessor(DataProcessor):
method get_train_examples (line 93) | def get_train_examples(self, data_dir):
method get_dev_examples (line 99) | def get_dev_examples(self, data_dir):
method get_labels (line 104) | def get_labels(self):
method _create_examples (line 108) | def _create_examples(self, lines, set_type):
class MnliProcessor (line 123) | class MnliProcessor(DataProcessor):
method get_train_examples (line 126) | def get_train_examples(self, data_dir):
method get_dev_examples (line 131) | def get_dev_examples(self, data_dir):
method get_labels (line 137) | def get_labels(self):
method _create_examples (line 141) | def _create_examples(self, lines, set_type):
class MnliMismatchedProcessor (line 156) | class MnliMismatchedProcessor(MnliProcessor):
method get_dev_examples (line 159) | def get_dev_examples(self, data_dir):
class ColaProcessor (line 166) | class ColaProcessor(DataProcessor):
method get_train_examples (line 169) | def get_train_examples(self, data_dir):
method get_dev_examples (line 174) | def get_dev_examples(self, data_dir):
method get_labels (line 179) | def get_labels(self):
method _create_examples (line 183) | def _create_examples(self, lines, set_type):
class Sst2Processor (line 195) | class Sst2Processor(DataProcessor):
method get_train_examples (line 198) | def get_train_examples(self, data_dir):
method get_dev_examples (line 203) | def get_dev_examples(self, data_dir):
method get_labels (line 208) | def get_labels(self):
method _create_examples (line 212) | def _create_examples(self, lines, set_type):
class StsbProcessor (line 226) | class StsbProcessor(DataProcessor):
method get_train_examples (line 229) | def get_train_examples(self, data_dir):
method get_dev_examples (line 234) | def get_dev_examples(self, data_dir):
method get_labels (line 239) | def get_labels(self):
method _create_examples (line 243) | def _create_examples(self, lines, set_type):
class QqpProcessor (line 258) | class QqpProcessor(DataProcessor):
method get_train_examples (line 261) | def get_train_examples(self, data_dir):
method get_dev_examples (line 266) | def get_dev_examples(self, data_dir):
method get_labels (line 271) | def get_labels(self):
method _create_examples (line 275) | def _create_examples(self, lines, set_type):
class QnliProcessor (line 293) | class QnliProcessor(DataProcessor):
method get_train_examples (line 296) | def get_train_examples(self, data_dir):
method get_dev_examples (line 301) | def get_dev_examples(self, data_dir):
method get_labels (line 307) | def get_labels(self):
method _create_examples (line 311) | def _create_examples(self, lines, set_type):
class RteProcessor (line 326) | class RteProcessor(DataProcessor):
method get_train_examples (line 329) | def get_train_examples(self, data_dir):
method get_dev_examples (line 334) | def get_dev_examples(self, data_dir):
method get_labels (line 339) | def get_labels(self):
method _create_examples (line 343) | def _create_examples(self, lines, set_type):
class WnliProcessor (line 358) | class WnliProcessor(DataProcessor):
method get_train_examples (line 361) | def get_train_examples(self, data_dir):
method get_dev_examples (line 366) | def get_dev_examples(self, data_dir):
method get_labels (line 371) | def get_labels(self):
method _create_examples (line 375) | def _create_examples(self, lines, set_type):
function convert_examples_to_features (line 390) | def convert_examples_to_features(examples, label_list, max_seq_length,
function _truncate_seq_pair (line 482) | def _truncate_seq_pair(tokens_a, tokens_b, max_length):
function simple_accuracy (line 499) | def simple_accuracy(preds, labels):
function acc_and_f1 (line 503) | def acc_and_f1(preds, labels):
function pearson_and_spearman (line 513) | def pearson_and_spearman(preds, labels):
function compute_metrics (line 523) | def compute_metrics(task_name, preds, labels):
FILE: examples/run_gpt2.py
function top_k_logits (line 18) | def top_k_logits(logits, k):
function sample_sequence (line 31) | def sample_sequence(model, length, start_token=None, batch_size=None, co...
function run_model (line 54) | def run_model():
FILE: examples/run_openai_gpt.py
function accuracy (line 52) | def accuracy(out, labels):
function load_rocstories_dataset (line 56) | def load_rocstories_dataset(dataset_path):
function pre_process_datasets (line 66) | def pre_process_datasets(encoded_datasets, input_len, cap_length, start_...
function main (line 93) | def main():
FILE: examples/run_squad.py
function main (line 51) | def main():
FILE: examples/run_squad_dataset_utils.py
class SquadExample (line 31) | class SquadExample(object):
method __init__ (line 37) | def __init__(self,
method __str__ (line 53) | def __str__(self):
method __repr__ (line 56) | def __repr__(self):
class InputFeatures (line 71) | class InputFeatures(object):
method __init__ (line 74) | def __init__(self,
function read_squad_examples (line 101) | def read_squad_examples(input_file, is_training, version_2_with_negative):
function convert_examples_to_features (line 179) | def convert_examples_to_features(examples, tokenizer, max_seq_length,
function _improve_answer_span (line 342) | def _improve_answer_span(doc_tokens, input_start, input_end, tokenizer,
function _check_is_max_context (line 379) | def _check_is_max_context(doc_spans, cur_span_index, position):
function write_predictions (line 420) | def write_predictions(all_examples, all_features, all_results, n_best_size,
function get_final_text (line 612) | def get_final_text(pred_text, orig_text, do_lower_case, verbose_logging=...
function _get_best_indexes (line 708) | def _get_best_indexes(logits, n_best_size):
function _compute_softmax (line 720) | def _compute_softmax(scores):
FILE: examples/run_swag.py
class SwagExample (line 46) | class SwagExample(object):
method __init__ (line 48) | def __init__(self,
method __str__ (line 68) | def __str__(self):
method __repr__ (line 71) | def __repr__(self):
class InputFeatures (line 88) | class InputFeatures(object):
method __init__ (line 89) | def __init__(self,
function read_swag_examples (line 107) | def read_swag_examples(input_file, is_training):
function convert_examples_to_features (line 138) | def convert_examples_to_features(examples, tokenizer, max_seq_length,
function _truncate_seq_pair (line 216) | def _truncate_seq_pair(tokens_a, tokens_b, max_length):
function accuracy (line 232) | def accuracy(out, labels):
function select_field (line 236) | def select_field(features, field):
function main (line 245) | def main():
FILE: examples/run_transfo_xl.py
function main (line 38) | def main():
FILE: examples/sem_run_classifier.py
class InputExample (line 51) | class InputExample(object):
method __init__ (line 54) | def __init__(self, guid, text_a, text_b=None, label=None, entity_pos=N...
class InputFeatures (line 72) | class InputFeatures(object):
method __init__ (line 75) | def __init__(self, input_ids, input_mask, segment_ids, label_id, entit...
class DataProcessor (line 86) | class DataProcessor(object):
method get_train_examples (line 89) | def get_train_examples(self, data_dir):
method get_dev_examples (line 93) | def get_dev_examples(self, data_dir):
method get_labels (line 97) | def get_labels(self):
method _read_tsv (line 102) | def _read_tsv(cls, input_file, quotechar=None):
class MrpcProcessor (line 114) | class MrpcProcessor(DataProcessor):
method get_train_examples (line 117) | def get_train_examples(self, data_dir):
method get_dev_examples (line 123) | def get_dev_examples(self, data_dir):
method get_labels (line 128) | def get_labels(self):
method _create_examples (line 132) | def _create_examples(self, lines, set_type):
class SemProcessor (line 146) | class SemProcessor(DataProcessor):
method get_train_examples (line 149) | def get_train_examples(self, data_dir):
method get_dev_examples (line 155) | def get_dev_examples(self, data_dir):
method get_labels (line 160) | def get_labels(self):
method _create_examples (line 164) | def _create_examples(self, lines, set_type):
class MnliProcessor (line 179) | class MnliProcessor(DataProcessor):
method get_train_examples (line 182) | def get_train_examples(self, data_dir):
method get_dev_examples (line 187) | def get_dev_examples(self, data_dir):
method get_labels (line 193) | def get_labels(self):
method _create_examples (line 197) | def _create_examples(self, lines, set_type):
class MnliMismatchedProcessor (line 212) | class MnliMismatchedProcessor(MnliProcessor):
method get_dev_examples (line 215) | def get_dev_examples(self, data_dir):
class ColaProcessor (line 222) | class ColaProcessor(DataProcessor):
method get_train_examples (line 225) | def get_train_examples(self, data_dir):
method get_dev_examples (line 230) | def get_dev_examples(self, data_dir):
method get_labels (line 235) | def get_labels(self):
method _create_examples (line 239) | def _create_examples(self, lines, set_type):
class Sst2Processor (line 251) | class Sst2Processor(DataProcessor):
method get_train_examples (line 254) | def get_train_examples(self, data_dir):
method get_dev_examples (line 259) | def get_dev_examples(self, data_dir):
method get_labels (line 264) | def get_labels(self):
method _create_examples (line 268) | def _create_examples(self, lines, set_type):
class StsbProcessor (line 282) | class StsbProcessor(DataProcessor):
method get_train_examples (line 285) | def get_train_examples(self, data_dir):
method get_dev_examples (line 290) | def get_dev_examples(self, data_dir):
method get_labels (line 295) | def get_labels(self):
method _create_examples (line 299) | def _create_examples(self, lines, set_type):
class QqpProcessor (line 314) | class QqpProcessor(DataProcessor):
method get_train_examples (line 317) | def get_train_examples(self, data_dir):
method get_dev_examples (line 322) | def get_dev_examples(self, data_dir):
method get_labels (line 327) | def get_labels(self):
method _create_examples (line 331) | def _create_examples(self, lines, set_type):
class QnliProcessor (line 349) | class QnliProcessor(DataProcessor):
method get_train_examples (line 352) | def get_train_examples(self, data_dir):
method get_dev_examples (line 357) | def get_dev_examples(self, data_dir):
method get_labels (line 363) | def get_labels(self):
method _create_examples (line 367) | def _create_examples(self, lines, set_type):
class RteProcessor (line 382) | class RteProcessor(DataProcessor):
method get_train_examples (line 385) | def get_train_examples(self, data_dir):
method get_dev_examples (line 390) | def get_dev_examples(self, data_dir):
method get_labels (line 395) | def get_labels(self):
method _create_examples (line 399) | def _create_examples(self, lines, set_type):
class WnliProcessor (line 414) | class WnliProcessor(DataProcessor):
method get_train_examples (line 417) | def get_train_examples(self, data_dir):
method get_dev_examples (line 422) | def get_dev_examples(self, data_dir):
method get_labels (line 427) | def get_labels(self):
method _create_examples (line 431) | def _create_examples(self, lines, set_type):
function convert_examples_to_features (line 446) | def convert_examples_to_features(examples, label_list, max_seq_length,
function _truncate_seq_pair (line 650) | def _truncate_seq_pair(tokens_a, tokens_b, max_length):
function simple_accuracy (line 667) | def simple_accuracy(preds, labels):
function acc_and_f1 (line 671) | def acc_and_f1(preds, labels):
function pearson_and_spearman (line 683) | def pearson_and_spearman(preds, labels):
function compute_metrics (line 693) | def compute_metrics(task_name, preds, labels):
function main (line 721) | def main():
FILE: examples/tacred_run_classifier.py
class InputExample (line 51) | class InputExample(object):
method __init__ (line 54) | def __init__(self, guid, text_a, text_b=None, label=None, entity_pos=N...
class InputFeatures (line 72) | class InputFeatures(object):
method __init__ (line 75) | def __init__(self, input_ids, input_mask, segment_ids, label_id, entit...
class DataProcessor (line 86) | class DataProcessor(object):
method get_train_examples (line 89) | def get_train_examples(self, data_dir):
method get_dev_examples (line 93) | def get_dev_examples(self, data_dir):
method get_labels (line 97) | def get_labels(self):
method _read_tsv (line 102) | def _read_tsv(cls, input_file, quotechar=None):
class MrpcProcessor (line 114) | class MrpcProcessor(DataProcessor):
method get_train_examples (line 117) | def get_train_examples(self, data_dir):
method get_dev_examples (line 123) | def get_dev_examples(self, data_dir):
method get_labels (line 128) | def get_labels(self):
method _create_examples (line 132) | def _create_examples(self, lines, set_type):
class SemProcessor (line 146) | class SemProcessor(DataProcessor):
method get_train_examples (line 149) | def get_train_examples(self, data_dir):
method get_dev_examples (line 155) | def get_dev_examples(self, data_dir):
method get_labels (line 160) | def get_labels(self):
method _create_examples (line 164) | def _create_examples(self, lines, set_type):
class TacredProcessor (line 177) | class TacredProcessor(DataProcessor):
method get_train_examples (line 180) | def get_train_examples(self, data_dir):
method get_dev_examples (line 186) | def get_dev_examples(self, data_dir):
method get_test_examples (line 191) | def get_test_examples(self, data_dir):
method get_labels (line 196) | def get_labels(self):
method _create_examples (line 199) | def _create_examples(self, lines, set_type):
class MnliProcessor (line 215) | class MnliProcessor(DataProcessor):
method get_train_examples (line 218) | def get_train_examples(self, data_dir):
method get_dev_examples (line 223) | def get_dev_examples(self, data_dir):
method get_labels (line 229) | def get_labels(self):
method _create_examples (line 233) | def _create_examples(self, lines, set_type):
class MnliMismatchedProcessor (line 248) | class MnliMismatchedProcessor(MnliProcessor):
method get_dev_examples (line 251) | def get_dev_examples(self, data_dir):
class ColaProcessor (line 258) | class ColaProcessor(DataProcessor):
method get_train_examples (line 261) | def get_train_examples(self, data_dir):
method get_dev_examples (line 266) | def get_dev_examples(self, data_dir):
method get_labels (line 271) | def get_labels(self):
method _create_examples (line 275) | def _create_examples(self, lines, set_type):
class Sst2Processor (line 287) | class Sst2Processor(DataProcessor):
method get_train_examples (line 290) | def get_train_examples(self, data_dir):
method get_dev_examples (line 295) | def get_dev_examples(self, data_dir):
method get_labels (line 300) | def get_labels(self):
method _create_examples (line 304) | def _create_examples(self, lines, set_type):
class StsbProcessor (line 318) | class StsbProcessor(DataProcessor):
method get_train_examples (line 321) | def get_train_examples(self, data_dir):
method get_dev_examples (line 326) | def get_dev_examples(self, data_dir):
method get_labels (line 331) | def get_labels(self):
method _create_examples (line 335) | def _create_examples(self, lines, set_type):
class QqpProcessor (line 350) | class QqpProcessor(DataProcessor):
method get_train_examples (line 353) | def get_train_examples(self, data_dir):
method get_dev_examples (line 358) | def get_dev_examples(self, data_dir):
method get_labels (line 363) | def get_labels(self):
method _create_examples (line 367) | def _create_examples(self, lines, set_type):
class QnliProcessor (line 385) | class QnliProcessor(DataProcessor):
method get_train_examples (line 388) | def get_train_examples(self, data_dir):
method get_dev_examples (line 393) | def get_dev_examples(self, data_dir):
method get_labels (line 399) | def get_labels(self):
method _create_examples (line 403) | def _create_examples(self, lines, set_type):
class RteProcessor (line 418) | class RteProcessor(DataProcessor):
method get_train_examples (line 421) | def get_train_examples(self, data_dir):
method get_dev_examples (line 426) | def get_dev_examples(self, data_dir):
method get_labels (line 431) | def get_labels(self):
method _create_examples (line 435) | def _create_examples(self, lines, set_type):
class WnliProcessor (line 450) | class WnliProcessor(DataProcessor):
method get_train_examples (line 453) | def get_train_examples(self, data_dir):
method get_dev_examples (line 458) | def get_dev_examples(self, data_dir):
method get_labels (line 463) | def get_labels(self):
method _create_examples (line 467) | def _create_examples(self, lines, set_type):
function convert_examples_to_features (line 481) | def convert_examples_to_features(examples, label_list, max_seq_length,
function _truncate_seq_pair (line 686) | def _truncate_seq_pair(tokens_a, tokens_b, max_length):
function simple_accuracy (line 703) | def simple_accuracy(preds, labels):
function acc_and_f1 (line 707) | def acc_and_f1(preds, labels):
function pearson_and_spearman (line 721) | def pearson_and_spearman(preds, labels):
function compute_metrics (line 731) | def compute_metrics(task_name, preds, labels):
function main (line 761) | def main():
FILE: examples/tacred_run_infer.py
class InputExample (line 39) | class InputExample(object):
method __init__ (line 42) | def __init__(self, guid, text_a, text_b=None, label=None, entity_pos=N...
class InputFeatures (line 60) | class InputFeatures(object):
method __init__ (line 63) | def __init__(self,input_ids, input_mask, segment_ids, label_id, entity...
class DataProcessor (line 74) | class DataProcessor(object):
method get_train_examples (line 77) | def get_train_examples(self, data_dir):
method get_dev_examples (line 81) | def get_dev_examples(self, data_dir):
method get_labels (line 85) | def get_labels(self):
method _read_tsv (line 90) | def _read_tsv(cls, input_file, quotechar=None):
class TacredProcessor (line 101) | class TacredProcessor(DataProcessor):
method get_train_examples (line 104) | def get_train_examples(self, data_dir):
method get_dev_examples (line 110) | def get_dev_examples(self, data_dir):
method get_test_examples (line 115) | def get_test_examples(self, data_dir):
method get_labels (line 120) | def get_labels(self):
method _create_examples (line 123) | def _create_examples(self, lines, set_type):
class _TacredProcessor (line 138) | class _TacredProcessor(DataProcessor):
method get_test_examples (line 141) | def get_test_examples(self, lines):
method get_labels (line 145) | def get_labels(self):
method _create_examples (line 149) | def _create_examples(self, lines, set_type):
function convert_examples_to_features (line 164) | def convert_examples_to_features(examples, label_list, max_seq_length,
function _truncate_seq_pair (line 359) | def _truncate_seq_pair(tokens_a, tokens_b, max_length):
function load_model (line 375) | def load_model():
function get_helper_model (line 453) | def get_helper_model(spacy_used=False):
function predict (line 466) | def predict():
FILE: hubconfs/bert_hubconf.py
function _append_from_pretrained_docstring (line 48) | def _append_from_pretrained_docstring(docstr):
function bertTokenizer (line 55) | def bertTokenizer(*args, **kwargs):
function bertModel (line 100) | def bertModel(*args, **kwargs):
function bertForNextSentencePrediction (line 129) | def bertForNextSentencePrediction(*args, **kwargs):
function bertForPreTraining (line 158) | def bertForPreTraining(*args, **kwargs):
function bertForMaskedLM (line 184) | def bertForMaskedLM(*args, **kwargs):
function bertForSequenceClassification (line 217) | def bertForSequenceClassification(*args, **kwargs):
function bertForMultipleChoice (line 257) | def bertForMultipleChoice(*args, **kwargs):
function bertForQuestionAnswering (line 292) | def bertForQuestionAnswering(*args, **kwargs):
function bertForTokenClassification (line 326) | def bertForTokenClassification(*args, **kwargs):
FILE: hubconfs/gpt2_hubconf.py
function _append_from_pretrained_docstring (line 28) | def _append_from_pretrained_docstring(docstr):
function gpt2Tokenizer (line 35) | def gpt2Tokenizer(*args, **kwargs):
function gpt2Model (line 66) | def gpt2Model(*args, **kwargs):
function gpt2LMHeadModel (line 100) | def gpt2LMHeadModel(*args, **kwargs):
function gpt2DoubleHeadsModel (line 138) | def gpt2DoubleHeadsModel(*args, **kwargs):
FILE: hubconfs/gpt_hubconf.py
function _append_from_pretrained_docstring (line 49) | def _append_from_pretrained_docstring(docstr):
function openAIGPTTokenizer (line 56) | def openAIGPTTokenizer(*args, **kwargs):
function openAIGPTModel (line 92) | def openAIGPTModel(*args, **kwargs):
function openAIGPTLMHeadModel (line 122) | def openAIGPTLMHeadModel(*args, **kwargs):
function openAIGPTDoubleHeadsModel (line 156) | def openAIGPTDoubleHeadsModel(*args, **kwargs):
FILE: hubconfs/transformer_xl_hubconf.py
function _append_from_pretrained_docstring (line 31) | def _append_from_pretrained_docstring(docstr):
function transformerXLTokenizer (line 38) | def transformerXLTokenizer(*args, **kwargs):
function transformerXLModel (line 60) | def transformerXLModel(*args, **kwargs):
function transformerXLLMHeadModel (line 94) | def transformerXLLMHeadModel(*args, **kwargs):
FILE: pytorch_pretrained_bert/__main__.py
function main (line 2) | def main():
FILE: pytorch_pretrained_bert/convert_gpt2_checkpoint_to_pytorch.py
function convert_gpt2_checkpoint_to_pytorch (line 30) | def convert_gpt2_checkpoint_to_pytorch(gpt2_checkpoint_path, gpt2_config...
FILE: pytorch_pretrained_bert/convert_openai_checkpoint_to_pytorch.py
function convert_openai_checkpoint_to_pytorch (line 30) | def convert_openai_checkpoint_to_pytorch(openai_checkpoint_folder_path, ...
FILE: pytorch_pretrained_bert/convert_pytorch_checkpoint_to_tf.py
function convert_pytorch_checkpoint_to_tf (line 26) | def convert_pytorch_checkpoint_to_tf(model:BertModel, ckpt_dir:str, mode...
function main (line 95) | def main(raw_args=None):
FILE: pytorch_pretrained_bert/convert_tf_checkpoint_to_pytorch.py
function convert_tf_checkpoint_to_pytorch (line 30) | def convert_tf_checkpoint_to_pytorch(tf_checkpoint_path, bert_config_fil...
FILE: pytorch_pretrained_bert/convert_transfo_xl_checkpoint_to_pytorch.py
function convert_transfo_xl_checkpoint_to_pytorch (line 47) | def convert_transfo_xl_checkpoint_to_pytorch(tf_checkpoint_path,
FILE: pytorch_pretrained_bert/file_utils.py
function url_to_filename (line 53) | def url_to_filename(url, etag=None):
function filename_to_url (line 71) | def filename_to_url(filename, cache_dir=None):
function cached_path (line 97) | def cached_path(url_or_filename, cache_dir=None):
function split_s3_path (line 127) | def split_s3_path(url):
function s3_request (line 140) | def s3_request(func):
function s3_etag (line 160) | def s3_etag(url):
function s3_get (line 169) | def s3_get(url, temp_file):
function http_get (line 176) | def http_get(url, temp_file):
function get_from_cache (line 188) | def get_from_cache(url, cache_dir=None):
function read_set_from_file (line 264) | def read_set_from_file(filename):
function get_file_extension (line 276) | def get_file_extension(path, dot=True, lower=True):
FILE: pytorch_pretrained_bert/modeling.py
function load_tf_weights_in_bert (line 51) | def load_tf_weights_in_bert(model, tf_checkpoint_path):
function gelu (line 118) | def gelu(x):
function swish (line 127) | def swish(x):
class BertConfig (line 134) | class BertConfig(object):
method __init__ (line 137) | def __init__(self,
method from_dict (line 199) | def from_dict(cls, json_object):
method from_json_file (line 207) | def from_json_file(cls, json_file):
method __repr__ (line 213) | def __repr__(self):
method to_dict (line 216) | def to_dict(self):
method to_json_string (line 221) | def to_json_string(self):
method to_json_file (line 225) | def to_json_file(self, json_file_path):
class BertLayerNorm (line 234) | class BertLayerNorm(nn.Module):
method __init__ (line 235) | def __init__(self, hidden_size, eps=1e-12):
method forward (line 243) | def forward(self, x):
class BertEmbeddings (line 249) | class BertEmbeddings(nn.Module):
method __init__ (line 252) | def __init__(self, config):
method forward (line 263) | def forward(self, input_ids, entity_pos_seg=None, entity_span1_pos=Non...
class BertSelfAttention (line 327) | class BertSelfAttention(nn.Module):
method __init__ (line 328) | def __init__(self, config):
method transpose_for_scores (line 344) | def transpose_for_scores(self, x):
method forward (line 349) | def forward(self, hidden_states, attention_mask):
class BertSelfOutput (line 378) | class BertSelfOutput(nn.Module):
method __init__ (line 379) | def __init__(self, config):
method forward (line 385) | def forward(self, hidden_states, input_tensor):
class BertAttention (line 392) | class BertAttention(nn.Module):
method __init__ (line 393) | def __init__(self, config):
method forward (line 398) | def forward(self, input_tensor, attention_mask):
class BertIntermediate (line 404) | class BertIntermediate(nn.Module):
method __init__ (line 405) | def __init__(self, config):
method forward (line 413) | def forward(self, hidden_states):
class BertOutput (line 419) | class BertOutput(nn.Module):
method __init__ (line 420) | def __init__(self, config):
method forward (line 426) | def forward(self, hidden_states, input_tensor):
class BertLayer (line 433) | class BertLayer(nn.Module):
method __init__ (line 434) | def __init__(self, config):
method forward (line 440) | def forward(self, hidden_states, attention_mask):
class BertEncoder (line 447) | class BertEncoder(nn.Module):
method __init__ (line 448) | def __init__(self, config):
method forward (line 453) | def forward(self, hidden_states, attention_mask, output_all_encoded_la...
class BertPooler (line 464) | class BertPooler(nn.Module):
method __init__ (line 465) | def __init__(self, config):
method forward (line 470) | def forward(self, hidden_states):
class BertPredictionHeadTransform (line 479) | class BertPredictionHeadTransform(nn.Module):
method __init__ (line 480) | def __init__(self, config):
method forward (line 489) | def forward(self, hidden_states):
class BertLMPredictionHead (line 496) | class BertLMPredictionHead(nn.Module):
method __init__ (line 497) | def __init__(self, config, bert_model_embedding_weights):
method forward (line 509) | def forward(self, hidden_states):
class BertOnlyMLMHead (line 515) | class BertOnlyMLMHead(nn.Module):
method __init__ (line 516) | def __init__(self, config, bert_model_embedding_weights):
method forward (line 520) | def forward(self, sequence_output):
class BertOnlyNSPHead (line 525) | class BertOnlyNSPHead(nn.Module):
method __init__ (line 526) | def __init__(self, config):
method forward (line 530) | def forward(self, pooled_output):
class BertPreTrainingHeads (line 535) | class BertPreTrainingHeads(nn.Module):
method __init__ (line 536) | def __init__(self, config, bert_model_embedding_weights):
method forward (line 541) | def forward(self, sequence_output, pooled_output):
class BertPreTrainedModel (line 547) | class BertPreTrainedModel(nn.Module):
method __init__ (line 551) | def __init__(self, config, *inputs, **kwargs):
method init_bert_weights (line 562) | def init_bert_weights(self, module):
method from_pretrained (line 576) | def from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwa...
class BertModel (line 708) | class BertModel(BertPreTrainedModel):
method __init__ (line 752) | def __init__(self, config):
method forward (line 759) | def forward(self, input_ids, entity_seg_pos = None, entity_span1_pos=N...
class BertForPreTraining (line 795) | class BertForPreTraining(BertPreTrainedModel):
method __init__ (line 845) | def __init__(self, config):
method forward (line 851) | def forward(self, input_ids, token_type_ids=None, attention_mask=None,...
class BertForMaskedLM (line 866) | class BertForMaskedLM(BertPreTrainedModel):
method __init__ (line 908) | def __init__(self, config):
method forward (line 914) | def forward(self, input_ids, token_type_ids=None, attention_mask=None,...
class BertForNextSentencePrediction (line 927) | class BertForNextSentencePrediction(BertPreTrainedModel):
method __init__ (line 970) | def __init__(self, config):
method forward (line 976) | def forward(self, input_ids, token_type_ids=None, attention_mask=None,...
class BertForSequenceClassification (line 989) | class BertForSequenceClassification(BertPreTrainedModel):
method __init__ (line 1034) | def __init__(self, config, num_labels):
method forward (line 1054) | def forward(self, input_ids, token_type_ids=None, attention_mask=None,...
class BertForMultipleChoice (line 1137) | class BertForMultipleChoice(BertPreTrainedModel):
method __init__ (line 1181) | def __init__(self, config, num_choices):
method forward (line 1189) | def forward(self, input_ids, token_type_ids=None, attention_mask=None,...
class BertForTokenClassification (line 1206) | class BertForTokenClassification(BertPreTrainedModel):
method __init__ (line 1251) | def __init__(self, config, num_labels):
method forward (line 1259) | def forward(self, input_ids, token_type_ids=None, attention_mask=None,...
class BertForQuestionAnswering (line 1279) | class BertForQuestionAnswering(BertPreTrainedModel):
method __init__ (line 1326) | def __init__(self, config):
method forward (line 1334) | def forward(self, input_ids, token_type_ids=None, attention_mask=None,...
FILE: pytorch_pretrained_bert/modeling_gpt2.py
function prune_conv1d_layer (line 44) | def prune_conv1d_layer(layer, index, dim=1):
function load_tf_weights_in_gpt2 (line 68) | def load_tf_weights_in_gpt2(model, gpt2_checkpoint_path):
function gelu (line 122) | def gelu(x):
class GPT2Config (line 126) | class GPT2Config(object):
method __init__ (line 130) | def __init__(
method total_tokens_embeddings (line 194) | def total_tokens_embeddings(self):
method from_dict (line 198) | def from_dict(cls, json_object):
method from_json_file (line 206) | def from_json_file(cls, json_file):
method __repr__ (line 212) | def __repr__(self):
method to_dict (line 215) | def to_dict(self):
method to_json_string (line 220) | def to_json_string(self):
method to_json_file (line 224) | def to_json_file(self, json_file_path):
class Conv1D (line 230) | class Conv1D(nn.Module):
method __init__ (line 231) | def __init__(self, nf, nx):
method forward (line 239) | def forward(self, x):
class Attention (line 246) | class Attention(nn.Module):
method __init__ (line 247) | def __init__(self, nx, n_ctx, config, scale=False, output_attentions=F...
method prune_heads (line 266) | def prune_heads(self, heads):
method _attn (line 282) | def _attn(self, q, k, v, head_mask=None):
method merge_heads (line 301) | def merge_heads(self, x):
method split_heads (line 306) | def split_heads(self, x, k=False):
method forward (line 314) | def forward(self, x, layer_past=None, head_mask=None):
class MLP (line 341) | class MLP(nn.Module):
method __init__ (line 342) | def __init__(self, n_state, config): # in MLP: n_state=3072 (4 * n_embd)
method forward (line 350) | def forward(self, x):
class Block (line 356) | class Block(nn.Module):
method __init__ (line 357) | def __init__(self, n_ctx, config, scale=False, output_attentions=False...
method forward (line 366) | def forward(self, x, layer_past=None, head_mask=None):
class GPT2LMHead (line 380) | class GPT2LMHead(nn.Module):
method __init__ (line 383) | def __init__(self, model_embeddings_weights, config):
method set_embeddings_weights (line 392) | def set_embeddings_weights(self, model_embeddings_weights, predict_spe...
method forward (line 396) | def forward(self, hidden_state):
class GPT2MultipleChoiceHead (line 403) | class GPT2MultipleChoiceHead(nn.Module):
method __init__ (line 406) | def __init__(self, config):
method forward (line 415) | def forward(self, hidden_states, mc_token_ids):
class GPT2PreTrainedModel (line 429) | class GPT2PreTrainedModel(nn.Module):
method __init__ (line 434) | def __init__(self, config, *inputs, **kwargs):
method init_weights (line 446) | def init_weights(self, module):
method from_pretrained (line 460) | def from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwa...
class GPT2Model (line 607) | class GPT2Model(GPT2PreTrainedModel):
method __init__ (line 668) | def __init__(self, config, output_attentions=False, keep_multihead_out...
method set_num_special_tokens (line 681) | def set_num_special_tokens(self, num_special_tokens):
method prune_heads (line 695) | def prune_heads(self, heads_to_prune):
method get_multihead_outputs (line 702) | def get_multihead_outputs(self):
method forward (line 708) | def forward(self, input_ids, position_ids=None, token_type_ids=None, p...
class GPT2LMHeadModel (line 768) | class GPT2LMHeadModel(GPT2PreTrainedModel):
method __init__ (line 817) | def __init__(self, config, output_attentions=False, keep_multihead_out...
method set_num_special_tokens (line 824) | def set_num_special_tokens(self, num_special_tokens, predict_special_t...
method forward (line 832) | def forward(self, input_ids, position_ids=None, token_type_ids=None, l...
class GPT2DoubleHeadsModel (line 855) | class GPT2DoubleHeadsModel(GPT2PreTrainedModel):
method __init__ (line 909) | def __init__(self, config, output_attentions=False, keep_multihead_out...
method set_num_special_tokens (line 917) | def set_num_special_tokens(self, num_special_tokens, predict_special_t...
method forward (line 925) | def forward(self, input_ids, mc_token_ids, lm_labels=None, mc_labels=N...
FILE: pytorch_pretrained_bert/modeling_openai.py
function load_tf_weights_in_openai_gpt (line 44) | def load_tf_weights_in_openai_gpt(model, openai_checkpoint_folder_path):
function gelu (line 114) | def gelu(x):
function swish (line 118) | def swish(x):
class OpenAIGPTConfig (line 125) | class OpenAIGPTConfig(object):
method __init__ (line 129) | def __init__(
method total_tokens_embeddings (line 197) | def total_tokens_embeddings(self):
method from_dict (line 201) | def from_dict(cls, json_object):
method from_json_file (line 209) | def from_json_file(cls, json_file):
method __repr__ (line 215) | def __repr__(self):
method to_dict (line 218) | def to_dict(self):
method to_json_string (line 223) | def to_json_string(self):
method to_json_file (line 227) | def to_json_file(self, json_file_path):
class Conv1D (line 233) | class Conv1D(nn.Module):
method __init__ (line 234) | def __init__(self, nf, rf, nx):
method forward (line 246) | def forward(self, x):
class Attention (line 256) | class Attention(nn.Module):
method __init__ (line 257) | def __init__(self, nx, n_ctx, config, scale=False, output_attentions=F...
method prune_heads (line 276) | def prune_heads(self, heads):
method _attn (line 292) | def _attn(self, q, k, v, head_mask=None):
method merge_heads (line 312) | def merge_heads(self, x):
method split_heads (line 317) | def split_heads(self, x, k=False):
method forward (line 325) | def forward(self, x, head_mask=None):
class MLP (line 347) | class MLP(nn.Module):
method __init__ (line 348) | def __init__(self, n_state, config): # in MLP: n_state=3072 (4 * n_embd)
method forward (line 356) | def forward(self, x):
class Block (line 362) | class Block(nn.Module):
method __init__ (line 363) | def __init__(self, n_ctx, config, scale=False, output_attentions=False...
method forward (line 372) | def forward(self, x, head_mask=None):
class OpenAIGPTLMHead (line 384) | class OpenAIGPTLMHead(nn.Module):
method __init__ (line 387) | def __init__(self, model_embeddings_weights, config):
method set_embeddings_weights (line 396) | def set_embeddings_weights(self, model_embeddings_weights, predict_spe...
method forward (line 401) | def forward(self, hidden_state):
class OpenAIGPTMultipleChoiceHead (line 408) | class OpenAIGPTMultipleChoiceHead(nn.Module):
method __init__ (line 411) | def __init__(self, config):
method forward (line 420) | def forward(self, hidden_states, mc_token_ids):
class OpenAIGPTPreTrainedModel (line 434) | class OpenAIGPTPreTrainedModel(nn.Module):
method __init__ (line 439) | def __init__(self, config, *inputs, **kwargs):
method init_weights (line 451) | def init_weights(self, module):
method from_pretrained (line 465) | def from_pretrained(cls, pretrained_model_name_or_path, num_special_to...
class OpenAIGPTModel (line 610) | class OpenAIGPTModel(OpenAIGPTPreTrainedModel):
method __init__ (line 666) | def __init__(self, config, output_attentions=False, keep_multihead_out...
method set_num_special_tokens (line 678) | def set_num_special_tokens(self, num_special_tokens):
method prune_heads (line 692) | def prune_heads(self, heads_to_prune):
method get_multihead_outputs (line 699) | def get_multihead_outputs(self):
method forward (line 705) | def forward(self, input_ids, position_ids=None, token_type_ids=None, h...
class OpenAIGPTLMHeadModel (line 760) | class OpenAIGPTLMHeadModel(OpenAIGPTPreTrainedModel):
method __init__ (line 821) | def __init__(self, config, output_attentions=False, keep_multihead_out...
method set_num_special_tokens (line 828) | def set_num_special_tokens(self, num_special_tokens, predict_special_t...
method forward (line 836) | def forward(self, input_ids, position_ids=None, token_type_ids=None, l...
class OpenAIGPTDoubleHeadsModel (line 857) | class OpenAIGPTDoubleHeadsModel(OpenAIGPTPreTrainedModel):
method __init__ (line 923) | def __init__(self, config, output_attentions=False, keep_multihead_out...
method set_num_special_tokens (line 931) | def set_num_special_tokens(self, num_special_tokens, predict_special_t...
method forward (line 939) | def forward(self, input_ids, mc_token_ids, lm_labels=None, mc_labels=N...
FILE: pytorch_pretrained_bert/modeling_transfo_xl.py
function build_tf_to_pytorch_map (line 53) | def build_tf_to_pytorch_map(model, config):
function load_tf_weights_in_transfo_xl (line 125) | def load_tf_weights_in_transfo_xl(model, config, tf_path):
class TransfoXLConfig (line 181) | class TransfoXLConfig(object):
method __init__ (line 184) | def __init__(self,
method from_dict (line 289) | def from_dict(cls, json_object):
method from_json_file (line 297) | def from_json_file(cls, json_file):
method __repr__ (line 303) | def __repr__(self):
method to_dict (line 306) | def to_dict(self):
method to_json_string (line 311) | def to_json_string(self):
method to_json_file (line 315) | def to_json_file(self, json_file_path):
class PositionalEmbedding (line 321) | class PositionalEmbedding(nn.Module):
method __init__ (line 322) | def __init__(self, demb):
method forward (line 330) | def forward(self, pos_seq, bsz=None):
class PositionwiseFF (line 340) | class PositionwiseFF(nn.Module):
method __init__ (line 341) | def __init__(self, d_model, d_inner, dropout, pre_lnorm=False):
method forward (line 359) | def forward(self, inp):
class MultiHeadAttn (line 375) | class MultiHeadAttn(nn.Module):
method __init__ (line 376) | def __init__(self, n_head, d_model, d_head, dropout, dropatt=0,
method forward (line 405) | def forward(self, h, attn_mask=None, mems=None):
class RelMultiHeadAttn (line 456) | class RelMultiHeadAttn(nn.Module):
method __init__ (line 457) | def __init__(self, n_head, d_model, d_head, dropout, dropatt=0,
method _parallelogram_mask (line 486) | def _parallelogram_mask(self, h, w, left=False):
method _shift (line 497) | def _shift(self, x, qlen, klen, mask, left=False):
method _rel_shift (line 515) | def _rel_shift(self, x, zero_triu=False):
method forward (line 531) | def forward(self, w, r, attn_mask=None, mems=None):
class RelPartialLearnableMultiHeadAttn (line 534) | class RelPartialLearnableMultiHeadAttn(RelMultiHeadAttn):
method __init__ (line 535) | def __init__(self, *args, **kwargs):
method forward (line 540) | def forward(self, w, r, attn_mask=None, mems=None):
class RelLearnableMultiHeadAttn (line 615) | class RelLearnableMultiHeadAttn(RelMultiHeadAttn):
method __init__ (line 616) | def __init__(self, *args, **kwargs):
method forward (line 619) | def forward(self, w, r_emb, r_w_bias, r_bias, attn_mask=None, mems=None):
class DecoderLayer (line 700) | class DecoderLayer(nn.Module):
method __init__ (line 701) | def __init__(self, n_head, d_model, d_head, d_inner, dropout, **kwargs):
method forward (line 708) | def forward(self, dec_inp, dec_attn_mask=None, mems=None):
class RelLearnableDecoderLayer (line 716) | class RelLearnableDecoderLayer(nn.Module):
method __init__ (line 717) | def __init__(self, n_head, d_model, d_head, d_inner, dropout,
method forward (line 726) | def forward(self, dec_inp, r_emb, r_w_bias, r_bias, dec_attn_mask=None...
class RelPartialLearnableDecoderLayer (line 735) | class RelPartialLearnableDecoderLayer(nn.Module):
method __init__ (line 736) | def __init__(self, n_head, d_model, d_head, d_inner, dropout,
method forward (line 745) | def forward(self, dec_inp, r, dec_attn_mask=None, mems=None):
class AdaptiveEmbedding (line 755) | class AdaptiveEmbedding(nn.Module):
method __init__ (line 756) | def __init__(self, n_token, d_embed, d_proj, cutoffs, div_val=1,
method forward (line 786) | def forward(self, inp):
class TransfoXLPreTrainedModel (line 819) | class TransfoXLPreTrainedModel(nn.Module):
method __init__ (line 823) | def __init__(self, config, *inputs, **kwargs):
method init_weight (line 834) | def init_weight(self, weight):
method init_bias (line 840) | def init_bias(self, bias):
method init_weights (line 843) | def init_weights(self, m):
method set_num_special_tokens (line 884) | def set_num_special_tokens(self, num_special_tokens):
method from_pretrained (line 888) | def from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwa...
class TransfoXLModel (line 1012) | class TransfoXLModel(TransfoXLPreTrainedModel):
method __init__ (line 1052) | def __init__(self, config):
method backward_compatible (line 1127) | def backward_compatible(self):
method reset_length (line 1131) | def reset_length(self, tgt_len, ext_len, mem_len):
method init_mems (line 1136) | def init_mems(self, data):
method _update_mems (line 1149) | def _update_mems(self, hids, mems, qlen, mlen):
method _forward (line 1172) | def _forward(self, dec_inp, mems=None):
method forward (line 1262) | def forward(self, input_ids, mems=None):
class TransfoXLLMHeadModel (line 1289) | class TransfoXLLMHeadModel(TransfoXLPreTrainedModel):
method __init__ (line 1339) | def __init__(self, config):
method tie_weights (line 1354) | def tie_weights(self):
method reset_length (line 1372) | def reset_length(self, tgt_len, ext_len, mem_len):
method init_mems (line 1375) | def init_mems(self, data):
method forward (line 1378) | def forward(self, input_ids, target=None, mems=None):
FILE: pytorch_pretrained_bert/modeling_transfo_xl_utilities.py
class ProjectedAdaptiveLogSoftmax (line 31) | class ProjectedAdaptiveLogSoftmax(nn.Module):
method __init__ (line 32) | def __init__(self, n_token, d_embed, d_proj, cutoffs, div_val=1,
method _compute_logit (line 78) | def _compute_logit(self, hidden, weight, bias, proj):
method forward (line 92) | def forward(self, hidden, target=None, keep_order=False):
method log_prob (line 198) | def log_prob(self, hidden):
class LogUniformSampler (line 260) | class LogUniformSampler(object):
method __init__ (line 261) | def __init__(self, range_max, n_sample):
method sample (line 281) | def sample(self, labels):
function sample_logits (line 302) | def sample_logits(embedding, bias, labels, inputs, sampler):
FILE: pytorch_pretrained_bert/optimization.py
class _LRSchedule (line 35) | class _LRSchedule(ABC):
method __init__ (line 38) | def __init__(self, warmup=0.002, t_total=-1, **kw):
method get_lr (line 53) | def get_lr(self, step, nowarn=False):
method get_lr_ (line 73) | def get_lr_(self, progress):
class ConstantLR (line 81) | class ConstantLR(_LRSchedule):
method get_lr_ (line 82) | def get_lr_(self, progress):
class WarmupCosineSchedule (line 86) | class WarmupCosineSchedule(_LRSchedule):
method __init__ (line 93) | def __init__(self, warmup=0.002, t_total=-1, cycles=.5, **kw):
method get_lr_ (line 103) | def get_lr_(self, progress):
class WarmupCosineWithHardRestartsSchedule (line 111) | class WarmupCosineWithHardRestartsSchedule(WarmupCosineSchedule):
method __init__ (line 117) | def __init__(self, warmup=0.002, t_total=-1, cycles=1., **kw):
method get_lr_ (line 121) | def get_lr_(self, progress):
class WarmupCosineWithWarmupRestartsSchedule (line 130) | class WarmupCosineWithWarmupRestartsSchedule(WarmupCosineWithHardRestart...
method __init__ (line 136) | def __init__(self, warmup=0.002, t_total=-1, cycles=1., **kw):
method get_lr_ (line 141) | def get_lr_(self, progress):
class WarmupConstantSchedule (line 151) | class WarmupConstantSchedule(_LRSchedule):
method get_lr_ (line 156) | def get_lr_(self, progress):
class WarmupLinearSchedule (line 162) | class WarmupLinearSchedule(_LRSchedule):
method get_lr_ (line 168) | def get_lr_(self, progress):
class BertAdam (line 183) | class BertAdam(Optimizer):
method __init__ (line 199) | def __init__(self, params, lr=required, warmup=-1, t_total=-1, schedul...
method get_lr (line 224) | def get_lr(self):
method step (line 236) | def step(self, closure=None):
FILE: pytorch_pretrained_bert/optimization_openai.py
class OpenAIAdam (line 29) | class OpenAIAdam(Optimizer):
method __init__ (line 32) | def __init__(self, params, lr=required, schedule='warmup_linear', warm...
method get_lr (line 58) | def get_lr(self):
method step (line 70) | def step(self, closure=None):
FILE: pytorch_pretrained_bert/tokenization.py
function load_vocab (line 50) | def load_vocab(vocab_file):
function whitespace_tokenize (line 65) | def whitespace_tokenize(text):
class BertTokenizer (line 74) | class BertTokenizer(object):
method __init__ (line 77) | def __init__(self, vocab_file, do_lower_case=True, max_len=None, do_ba...
method tokenize (line 107) | def tokenize(self, text, entity_pos=None):
method convert_tokens_to_ids (line 141) | def convert_tokens_to_ids(self, tokens):
method convert_ids_to_tokens (line 154) | def convert_ids_to_tokens(self, ids):
method save_vocabulary (line 161) | def save_vocabulary(self, vocab_path):
method from_pretrained (line 177) | def from_pretrained(cls, pretrained_model_name_or_path, cache_dir=None...
class BasicTokenizer (line 225) | class BasicTokenizer(object):
method __init__ (line 228) | def __init__(self,
method tokenize (line 239) | def tokenize(self, text):
method _run_strip_accents (line 260) | def _run_strip_accents(self, text):
method _run_split_on_punc (line 271) | def _run_split_on_punc(self, text):
method _tokenize_chinese_chars (line 293) | def _tokenize_chinese_chars(self, text):
method _is_chinese_char (line 306) | def _is_chinese_char(self, cp):
method _clean_text (line 328) | def _clean_text(self, text):
class WordpieceTokenizer (line 342) | class WordpieceTokenizer(object):
method __init__ (line 345) | def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=...
method tokenize (line 350) | def tokenize(self, text):
function _is_whitespace (line 402) | def _is_whitespace(char):
function _is_control (line 414) | def _is_control(char):
function _is_punctuation (line 426) | def _is_punctuation(char):
FILE: pytorch_pretrained_bert/tokenization_gpt2.py
function lru_cache (line 31) | def lru_cache():
function bytes_to_unicode (line 54) | def bytes_to_unicode():
function get_pairs (line 76) | def get_pairs(word):
class GPT2Tokenizer (line 88) | class GPT2Tokenizer(object):
method from_pretrained (line 94) | def from_pretrained(cls, pretrained_model_name_or_path, cache_dir=None...
method __init__ (line 151) | def __init__(self, vocab_file, merges_file, errors='replace', special_...
method __len__ (line 170) | def __len__(self):
method set_special_tokens (line 173) | def set_special_tokens(self, special_tokens):
method bpe (line 186) | def bpe(self, token):
method tokenize (line 227) | def tokenize(self, text):
method convert_tokens_to_ids (line 238) | def convert_tokens_to_ids(self, tokens):
method convert_ids_to_tokens (line 259) | def convert_ids_to_tokens(self, ids, skip_special_tokens=False):
method encode (line 270) | def encode(self, text):
method decode (line 273) | def decode(self, tokens, skip_special_tokens=False, clean_up_tokenizat...
method save_vocabulary (line 283) | def save_vocabulary(self, vocab_path):
FILE: pytorch_pretrained_bert/tokenization_openai.py
function get_pairs (line 46) | def get_pairs(word):
function text_standardize (line 58) | def text_standardize(text):
class OpenAIGPTTokenizer (line 73) | class OpenAIGPTTokenizer(object):
method from_pretrained (line 82) | def from_pretrained(cls, pretrained_model_name_or_path, cache_dir=None...
method __init__ (line 139) | def __init__(self, vocab_file, merges_file, special_tokens=None, max_l...
method __len__ (line 162) | def __len__(self):
method set_special_tokens (line 165) | def set_special_tokens(self, special_tokens):
method bpe (line 181) | def bpe(self, token):
method tokenize (line 224) | def tokenize(self, text):
method convert_tokens_to_ids (line 239) | def convert_tokens_to_ids(self, tokens):
method convert_ids_to_tokens (line 260) | def convert_ids_to_tokens(self, ids, skip_special_tokens=False):
method encode (line 271) | def encode(self, text):
method decode (line 274) | def decode(self, ids, skip_special_tokens=False, clean_up_tokenization...
method save_vocabulary (line 285) | def save_vocabulary(self, vocab_path):
FILE: pytorch_pretrained_bert/tokenization_transfo_xl.py
class TransfoXLTokenizer (line 53) | class TransfoXLTokenizer(object):
method from_pretrained (line 58) | def from_pretrained(cls, pretrained_model_name_or_path, cache_dir=None...
method __init__ (line 101) | def __init__(self, special=[], min_freq=0, max_size=None, lower_case=F...
method count_file (line 112) | def count_file(self, path, verbose=False, add_eos=False):
method count_sents (line 127) | def count_sents(self, sents, verbose=False):
method _build_from_file (line 137) | def _build_from_file(self, vocab_file):
method save_vocabulary (line 152) | def save_vocabulary(self, vocab_path):
method build_vocab (line 160) | def build_vocab(self):
method encode_file (line 181) | def encode_file(self, path, ordered=False, verbose=False, add_eos=True,
method encode_sents (line 199) | def encode_sents(self, sents, ordered=False, verbose=False):
method add_special (line 212) | def add_special(self, sym):
method add_symbol (line 218) | def add_symbol(self, sym):
method get_sym (line 223) | def get_sym(self, idx):
method get_idx (line 227) | def get_idx(self, sym):
method convert_ids_to_tokens (line 243) | def convert_ids_to_tokens(self, indices):
method convert_tokens_to_ids (line 247) | def convert_tokens_to_ids(self, symbols):
method convert_to_tensor (line 251) | def convert_to_tensor(self, symbols):
method decode (line 254) | def decode(self, indices, exclude=None):
method __len__ (line 261) | def __len__(self):
method tokenize (line 264) | def tokenize(self, line, add_eos=False, add_double_eos=False):
class LMOrderedIterator (line 284) | class LMOrderedIterator(object):
method __init__ (line 285) | def __init__(self, data, bsz, bptt, device='cpu', ext_len=None):
method get_batch (line 307) | def get_batch(self, i, bptt=None):
method get_fixlen_iter (line 322) | def get_fixlen_iter(self, start=0):
method get_varlen_iter (line 326) | def get_varlen_iter(self, start=0, std=5, min_len=5, max_deviation=3):
method __iter__ (line 338) | def __iter__(self):
class LMShuffledIterator (line 342) | class LMShuffledIterator(object):
method __init__ (line 343) | def __init__(self, data, bsz, bptt, device='cpu', ext_len=None, shuffl...
method get_sent_stream (line 356) | def get_sent_stream(self):
method stream_iterator (line 365) | def stream_iterator(self, sent_stream):
method __iter__ (line 414) | def __iter__(self):
class LMMultiFileIterator (line 422) | class LMMultiFileIterator(LMShuffledIterator):
method __init__ (line 423) | def __init__(self, paths, vocab, bsz, bptt, device='cpu', ext_len=None,
method get_sent_stream (line 436) | def get_sent_stream(self, path):
method __iter__ (line 444) | def __iter__(self):
class TransfoXLCorpus (line 455) | class TransfoXLCorpus(object):
method from_pretrained (line 457) | def from_pretrained(cls, pretrained_model_name_or_path, cache_dir=None...
method __init__ (line 499) | def __init__(self, *args, **kwargs):
method build_corpus (line 506) | def build_corpus(self, path, dataset):
method get_iterator (line 545) | def get_iterator(self, split, *args, **kwargs):
function get_lm_corpus (line 562) | def get_lm_corpus(datadir, dataset):
FILE: tests/conftest.py
function pytest_addoption (line 6) | def pytest_addoption(parser):
function pytest_collection_modifyitems (line 12) | def pytest_collection_modifyitems(config, items):
FILE: tests/modeling_gpt2_test.py
class GPT2ModelTest (line 32) | class GPT2ModelTest(unittest.TestCase):
class GPT2ModelTester (line 33) | class GPT2ModelTester(object):
method __init__ (line 35) | def __init__(self,
method prepare_config_and_inputs (line 73) | def prepare_config_and_inputs(self):
method create_gpt2_model (line 106) | def create_gpt2_model(self, config, input_ids, token_type_ids, posit...
method check_gpt2_model_output (line 117) | def check_gpt2_model_output(self, result):
method create_gpt2_lm_head (line 124) | def create_gpt2_lm_head(self, config, input_ids, token_type_ids, pos...
method create_gpt2_lm_head_with_output_attention (line 137) | def create_gpt2_lm_head_with_output_attention(self, config, input_id...
method check_gpt2_lm_head_output (line 151) | def check_gpt2_lm_head_output(self, result):
method check_gpt2_lm_head_loss_output (line 161) | def check_gpt2_lm_head_loss_output(self, result):
method create_gpt2_double_heads (line 166) | def create_gpt2_double_heads(self, config, input_ids, token_type_ids...
method create_gpt2_double_heads_with_output_attention (line 182) | def create_gpt2_double_heads_with_output_attention(self, config, inp...
method check_gpt2_double_heads_output (line 199) | def check_gpt2_double_heads_output(self, result):
method check_gpt2_double_heads_loss_output (line 208) | def check_gpt2_double_heads_loss_output(self, result):
method create_and_check_gpt2_for_headmasking (line 213) | def create_and_check_gpt2_for_headmasking(self, config, input_ids, t...
method create_and_check_gpt2_for_head_pruning (line 268) | def create_and_check_gpt2_for_head_pruning(self, config, input_ids, ...
method test_default (line 305) | def test_default(self):
method test_config_to_json_string (line 308) | def test_config_to_json_string(self):
method test_config_to_json_file (line 314) | def test_config_to_json_file(self):
method test_model_from_pretrained (line 323) | def test_model_from_pretrained(self):
method run_tester (line 330) | def run_tester(self, tester):
method ids_tensor (line 347) | def ids_tensor(cls, shape, vocab_size, rng=None, name=None):
FILE: tests/modeling_openai_test.py
class OpenAIGPTModelTest (line 32) | class OpenAIGPTModelTest(unittest.TestCase):
class OpenAIGPTModelTester (line 33) | class OpenAIGPTModelTester(object):
method __init__ (line 35) | def __init__(self,
method prepare_config_and_inputs (line 81) | def prepare_config_and_inputs(self):
method create_openai_model (line 117) | def create_openai_model(self, config, input_ids, token_type_ids, pos...
method check_openai_model_output (line 127) | def check_openai_model_output(self, result):
method create_openai_lm_head (line 134) | def create_openai_lm_head(self, config, input_ids, token_type_ids, p...
method check_openai_lm_head_output (line 146) | def check_openai_lm_head_output(self, result):
method check_openai_lm_head_loss_output (line 152) | def check_openai_lm_head_loss_output(self, result):
method create_openai_double_heads (line 157) | def create_openai_double_heads(self, config, input_ids, token_type_i...
method check_openai_double_heads_output (line 172) | def check_openai_double_heads_output(self, result):
method check_openai_double_heads_loss_output (line 181) | def check_openai_double_heads_loss_output(self, result):
method create_and_check_openai_for_headmasking (line 186) | def create_and_check_openai_for_headmasking(self, config, input_ids,...
method create_and_check_openai_for_head_pruning (line 242) | def create_and_check_openai_for_head_pruning(self, config, input_ids...
method test_default (line 279) | def test_default(self):
method test_config_to_json_string (line 282) | def test_config_to_json_string(self):
method test_config_to_json_file (line 288) | def test_config_to_json_file(self):
method test_model_from_pretrained (line 297) | def test_model_from_pretrained(self):
method run_tester (line 304) | def run_tester(self, tester):
method ids_tensor (line 321) | def ids_tensor(cls, shape, vocab_size, rng=None, name=None):
FILE: tests/modeling_test.py
class BertModelTest (line 35) | class BertModelTest(unittest.TestCase):
class BertModelTester (line 36) | class BertModelTester(object):
method __init__ (line 38) | def __init__(self,
method prepare_config_and_inputs (line 84) | def prepare_config_and_inputs(self):
method check_loss_output (line 118) | def check_loss_output(self, result):
method create_bert_model (line 123) | def create_bert_model(self, config, input_ids, token_type_ids, input...
method check_bert_model_output (line 134) | def check_bert_model_output(self, result):
method create_bert_for_masked_lm (line 144) | def create_bert_for_masked_lm(self, config, input_ids, token_type_id...
method check_bert_for_masked_lm_output (line 155) | def check_bert_for_masked_lm_output(self, result):
method create_bert_for_next_sequence_prediction (line 160) | def create_bert_for_next_sequence_prediction(self, config, input_ids...
method check_bert_for_next_sequence_prediction_output (line 171) | def check_bert_for_next_sequence_prediction_output(self, result):
method create_bert_for_pretraining (line 177) | def create_bert_for_pretraining(self, config, input_ids, token_type_...
method check_bert_for_pretraining_output (line 189) | def check_bert_for_pretraining_output(self, result):
method create_bert_for_question_answering (line 198) | def create_bert_for_question_answering(self, config, input_ids, toke...
method check_bert_for_question_answering_output (line 210) | def check_bert_for_question_answering_output(self, result):
method create_bert_for_sequence_classification (line 219) | def create_bert_for_sequence_classification(self, config, input_ids,...
method check_bert_for_sequence_classification_output (line 230) | def check_bert_for_sequence_classification_output(self, result):
method create_bert_for_token_classification (line 236) | def create_bert_for_token_classification(self, config, input_ids, to...
method check_bert_for_token_classification_output (line 247) | def check_bert_for_token_classification_output(self, result):
method create_bert_for_multiple_choice (line 253) | def create_bert_for_multiple_choice(self, config, input_ids, token_t...
method check_bert_for_multiple_choice (line 272) | def check_bert_for_multiple_choice(self, result):
method create_and_check_bert_for_attentions (line 278) | def create_and_check_bert_for_attentions(self, config, input_ids, to...
method create_and_check_bert_for_headmasking (line 296) | def create_and_check_bert_for_headmasking(self, config, input_ids, t...
method create_and_check_bert_for_head_pruning (line 356) | def create_and_check_bert_for_head_pruning(self, config, input_ids, ...
method test_default (line 397) | def test_default(self):
method test_config_to_json_string (line 400) | def test_config_to_json_string(self):
method test_config_to_json_file (line 406) | def test_config_to_json_file(self):
method test_model_from_pretrained (line 415) | def test_model_from_pretrained(self):
method run_tester (line 422) | def run_tester(self, tester):
method ids_tensor (line 460) | def ids_tensor(cls, shape, vocab_size, rng=None, name=None):
FILE: tests/modeling_transfo_xl_test.py
class TransfoXLModelTest (line 31) | class TransfoXLModelTest(unittest.TestCase):
class TransfoXLModelTester (line 32) | class TransfoXLModelTester(object):
method __init__ (line 34) | def __init__(self,
method prepare_config_and_inputs (line 72) | def prepare_config_and_inputs(self):
method set_seed (line 95) | def set_seed(self):
method create_transfo_xl_model (line 99) | def create_transfo_xl_model(self, config, input_ids_1, input_ids_2, ...
method check_transfo_xl_model_output (line 113) | def check_transfo_xl_model_output(self, result):
method create_transfo_xl_lm_head (line 128) | def create_transfo_xl_lm_head(self, config, input_ids_1, input_ids_2...
method check_transfo_xl_lm_head_output (line 150) | def check_transfo_xl_lm_head_output(self, result):
method test_default (line 183) | def test_default(self):
method test_config_to_json_string (line 186) | def test_config_to_json_string(self):
method test_config_to_json_file (line 192) | def test_config_to_json_file(self):
method test_model_from_pretrained (line 201) | def test_model_from_pretrained(self):
method run_tester (line 208) | def run_tester(self, tester):
method ids_tensor (line 220) | def ids_tensor(cls, shape, vocab_size, rng=None, name=None):
FILE: tests/optimization_test.py
class OptimizationTest (line 30) | class OptimizationTest(unittest.TestCase):
method assertListAlmostEqual (line 32) | def assertListAlmostEqual(self, list1, list2, tol):
method test_adam (line 37) | def test_adam(self):
class ScheduleInitTest (line 54) | class ScheduleInitTest(unittest.TestCase):
method test_bert_sched_init (line 55) | def test_bert_sched_init(self):
method test_openai_sched_init (line 65) | def test_openai_sched_init(self):
class WarmupCosineWithRestartsTest (line 76) | class WarmupCosineWithRestartsTest(unittest.TestCase):
method test_it (line 77) | def test_it(self):
FILE: tests/tokenization_gpt2_test.py
class GPT2TokenizationTest (line 26) | class GPT2TokenizationTest(unittest.TestCase):
method test_full_tokenizer (line 28) | def test_full_tokenizer(self):
method test_tokenizer_from_pretrained (line 69) | def test_tokenizer_from_pretrained(self):
FILE: tests/tokenization_openai_test.py
class OpenAIGPTTokenizationTest (line 26) | class OpenAIGPTTokenizationTest(unittest.TestCase):
method test_full_tokenizer (line 28) | def test_full_tokenizer(self):
method test_tokenizer_from_pretrained (line 70) | def test_tokenizer_from_pretrained(self):
FILE: tests/tokenization_test.py
class TokenizationTest (line 30) | class TokenizationTest(unittest.TestCase):
method test_full_tokenizer (line 32) | def test_full_tokenizer(self):
method test_tokenizer_from_pretrained (line 62) | def test_tokenizer_from_pretrained(self):
method test_chinese (line 69) | def test_chinese(self):
method test_basic_tokenizer_lower (line 76) | def test_basic_tokenizer_lower(self):
method test_basic_tokenizer_no_lower (line 84) | def test_basic_tokenizer_no_lower(self):
method test_wordpiece_tokenizer (line 91) | def test_wordpiece_tokenizer(self):
method test_is_whitespace (line 111) | def test_is_whitespace(self):
method test_is_control (line 121) | def test_is_control(self):
method test_is_punctuation (line 129) | def test_is_punctuation(self):
FILE: tests/tokenization_transfo_xl_test.py
class TransfoXLTokenizationTest (line 26) | class TransfoXLTokenizationTest(unittest.TestCase):
method test_full_tokenizer (line 28) | def test_full_tokenizer(self):
method test_full_tokenizer_lower (line 57) | def test_full_tokenizer_lower(self):
method test_full_tokenizer_no_lower (line 64) | def test_full_tokenizer_no_lower(self):
method test_tokenizer_from_pretrained (line 72) | def test_tokenizer_from_pretrained(self):
Condensed preview — 65 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (1,500K chars).
[
{
"path": "LICENSE",
"chars": 11358,
"preview": "\n Apache License\n Version 2.0, January 2004\n "
},
{
"path": "MANIFEST.in",
"chars": 16,
"preview": "include LICENSE\n"
},
{
"path": "README.md",
"chars": 1870,
"preview": "### 实现说明\n\n主要实现文章前半部分的工作,PyTorch实现,基于[huggingface](https://github.com/huggingface/pytorch-pretrained-BERT)的工作,PyTorch才是世界"
},
{
"path": "docker/Dockerfile",
"chars": 197,
"preview": "FROM pytorch/pytorch:latest\n\nRUN git clone https://github.com/NVIDIA/apex.git && cd apex && python setup.py install --cu"
},
{
"path": "examples/bertology.py",
"chars": 17149,
"preview": "#!/usr/bin/env python3\nimport os\nimport argparse\nimport logging\nfrom datetime import timedelta, datetime\nfrom tqdm impor"
},
{
"path": "examples/extract_features.py",
"chars": 12208,
"preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under th"
},
{
"path": "examples/lm_finetuning/README.md",
"chars": 6210,
"preview": "# BERT Model Finetuning using Masked Language Modeling objective\n\n## Introduction\n\nThe three example scripts in this fol"
},
{
"path": "examples/lm_finetuning/finetune_on_pregenerated.py",
"chars": 16453,
"preview": "from argparse import ArgumentParser\nfrom pathlib import Path\nimport os\nimport torch\nimport logging\nimport json\nimport ra"
},
{
"path": "examples/lm_finetuning/pregenerate_training_data.py",
"chars": 16270,
"preview": "from argparse import ArgumentParser\nfrom pathlib import Path\nfrom tqdm import tqdm, trange\nfrom tempfile import Temporar"
},
{
"path": "examples/lm_finetuning/simple_lm_finetuning.py",
"chars": 28381,
"preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018,"
},
{
"path": "examples/run_classifier.py",
"chars": 51160,
"preview": "#coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, "
},
{
"path": "examples/run_classifier_dataset_utils.py",
"chars": 19787,
"preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018,"
},
{
"path": "examples/run_gpt2.py",
"chars": 5222,
"preview": "#!/usr/bin/env python3\n\nimport argparse\nimport logging\nfrom tqdm import trange\n\nimport torch\nimport torch.nn.functional "
},
{
"path": "examples/run_openai_gpt.py",
"chars": 13653,
"preview": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. "
},
{
"path": "examples/run_squad.py",
"chars": 21799,
"preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018,"
},
{
"path": "examples/run_squad_dataset_utils.py",
"chars": 30976,
"preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018,"
},
{
"path": "examples/run_swag.py",
"chars": 24323,
"preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018,"
},
{
"path": "examples/run_transfo_xl.py",
"chars": 6735,
"preview": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. "
},
{
"path": "examples/sem_run_classifier.py",
"chars": 51160,
"preview": "#coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, "
},
{
"path": "examples/tacred_run_classifier.py",
"chars": 50886,
"preview": "#coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, "
},
{
"path": "examples/tacred_run_infer.py",
"chars": 23695,
"preview": "from __future__ import absolute_import, division, print_function\n\nimport argparse\nimport csv\nimport logging\nimport os\nim"
},
{
"path": "examples/test.sh",
"chars": 492,
"preview": "#export GLUE_DIR=/data/share/zhanghaipeng/pytorch-pretrained-BERT/examples/general_ner_test\nexport GLUE_DIR=/data/share/"
},
{
"path": "examples/train.sh",
"chars": 461,
"preview": "export GLUE_DIR=/data/share/zhanghaipeng/tre/datasets/data\nexport TASK_NAME=tacred\n\nEXPR=25\nBS=16\nCUDA=2\nLR=3e-5\nEPOCH=4"
},
{
"path": "hubconf.py",
"chars": 723,
"preview": "dependencies = ['torch', 'tqdm', 'boto3', 'requests', 'regex']\n\nfrom hubconfs.bert_hubconf import (\n bertTokenizer,\n "
},
{
"path": "hubconfs/bert_hubconf.py",
"chars": 17306,
"preview": "from pytorch_pretrained_bert.tokenization import BertTokenizer\nfrom pytorch_pretrained_bert.modeling import (\n Be"
},
{
"path": "hubconfs/gpt2_hubconf.py",
"chars": 7052,
"preview": "from pytorch_pretrained_bert.tokenization_gpt2 import GPT2Tokenizer\nfrom pytorch_pretrained_bert.modeling_gpt2 import (\n"
},
{
"path": "hubconfs/gpt_hubconf.py",
"chars": 8281,
"preview": "from pytorch_pretrained_bert.tokenization_openai import OpenAIGPTTokenizer\nfrom pytorch_pretrained_bert.modeling_openai "
},
{
"path": "hubconfs/transformer_xl_hubconf.py",
"chars": 5856,
"preview": "from pytorch_pretrained_bert.tokenization_transfo_xl import TransfoXLTokenizer\nfrom pytorch_pretrained_bert.modeling_tra"
},
{
"path": "notebooks/Comparing-PT-and-TF-models.ipynb",
"chars": 92238,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"metadata\": {},\n \"source\": [\n \"# Pytorch to Tensorflow Conversion "
},
{
"path": "notebooks/Comparing-TF-and-PT-models-MLM-NSP.ipynb",
"chars": 173162,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"metadata\": {},\n \"source\": [\n \"# Comparing TensorFlow (original) a"
},
{
"path": "notebooks/Comparing-TF-and-PT-models-SQuAD.ipynb",
"chars": 207537,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"metadata\": {},\n \"source\": [\n \"# Comparing TensorFlow (original) a"
},
{
"path": "notebooks/Comparing-TF-and-PT-models.ipynb",
"chars": 62623,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"metadata\": {},\n \"source\": [\n \"# Comparing TensorFlow (original) a"
},
{
"path": "pytorch_pretrained_bert/__init__.py",
"chars": 1337,
"preview": "__version__ = \"0.6.2\"\nfrom .tokenization import BertTokenizer, BasicTokenizer, WordpieceTokenizer\nfrom .tokenization_ope"
},
{
"path": "pytorch_pretrained_bert/__main__.py",
"chars": 4393,
"preview": "# coding: utf8\ndef main():\n import sys\n if (len(sys.argv) != 4 and len(sys.argv) != 5) or sys.argv[1] not in [\n "
},
{
"path": "pytorch_pretrained_bert/convert_gpt2_checkpoint_to_pytorch.py",
"chars": 3017,
"preview": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Lice"
},
{
"path": "pytorch_pretrained_bert/convert_openai_checkpoint_to_pytorch.py",
"chars": 3106,
"preview": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Lice"
},
{
"path": "pytorch_pretrained_bert/convert_pytorch_checkpoint_to_tf.py",
"chars": 4343,
"preview": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Lice"
},
{
"path": "pytorch_pretrained_bert/convert_tf_checkpoint_to_pytorch.py",
"chars": 2593,
"preview": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Lice"
},
{
"path": "pytorch_pretrained_bert/convert_transfo_xl_checkpoint_to_pytorch.py",
"chars": 5671,
"preview": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Lice"
},
{
"path": "pytorch_pretrained_bert/file_utils.py",
"chars": 9347,
"preview": "\"\"\"\nUtilities for working with the local dataset cache.\nThis file is adapted from the AllenNLP library at https://github"
},
{
"path": "pytorch_pretrained_bert/modeling.py",
"chars": 66537,
"preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018,"
},
{
"path": "pytorch_pretrained_bert/modeling_gpt2.py",
"chars": 45614,
"preview": "# coding=utf-8\n# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORAT"
},
{
"path": "pytorch_pretrained_bert/modeling_openai.py",
"chars": 46459,
"preview": "# coding=utf-8\n# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORAT"
},
{
"path": "pytorch_pretrained_bert/modeling_transfo_xl.py",
"chars": 60075,
"preview": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. "
},
{
"path": "pytorch_pretrained_bert/modeling_transfo_xl_utilities.py",
"chars": 16108,
"preview": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. "
},
{
"path": "pytorch_pretrained_bert/optimization.py",
"chars": 13047,
"preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under th"
},
{
"path": "pytorch_pretrained_bert/optimization_openai.py",
"chars": 5558,
"preview": "# coding=utf-8\n# Copyright 2018 The Open AI Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache Li"
},
{
"path": "pytorch_pretrained_bert/tokenization.py",
"chars": 18201,
"preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under th"
},
{
"path": "pytorch_pretrained_bert/tokenization_gpt2.py",
"chars": 14181,
"preview": "# coding=utf-8\n# Copyright 2018 The Open AI Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache Li"
},
{
"path": "pytorch_pretrained_bert/tokenization_openai.py",
"chars": 14189,
"preview": "# coding=utf-8\n# Copyright 2018 The Open AI Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache Li"
},
{
"path": "pytorch_pretrained_bert/tokenization_transfo_xl.py",
"chars": 22339,
"preview": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. "
},
{
"path": "requirements.txt",
"chars": 196,
"preview": "# PyTorch\ntorch>=0.4.1\n# progress bars in model download and training scripts\ntqdm\n# Accessing files from S3 directly.\nb"
},
{
"path": "samples/input.txt",
"chars": 52,
"preview": "Who was Jim Henson ? ||| Jim Henson was a puppeteer\n"
},
{
"path": "samples/sample_text.txt",
"chars": 4364,
"preview": "This text is included to make sure Unicode is handled properly: 力加勝北区ᴵᴺᵀᵃছজটডণত\nText should be one-sentence-per-line, wi"
},
{
"path": "setup.py",
"chars": 2798,
"preview": "\"\"\"\nSimple check list from AllenNLP repo: https://github.com/allenai/allennlp/blob/master/setup.py\n\nTo create the packag"
},
{
"path": "tests/conftest.py",
"chars": 511,
"preview": "# content of conftest.py\n\nimport pytest\n\n\ndef pytest_addoption(parser):\n parser.addoption(\n \"--runslow\", actio"
},
{
"path": "tests/modeling_gpt2_test.py",
"chars": 16770,
"preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors.\n#\n# Licensed under the Apache License, Version 2.0 "
},
{
"path": "tests/modeling_openai_test.py",
"chars": 15409,
"preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors.\n#\n# Licensed under the Apache License, Version 2.0 "
},
{
"path": "tests/modeling_test.py",
"chars": 23337,
"preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors.\n#\n# Licensed under the Apache License, Version 2.0 "
},
{
"path": "tests/modeling_transfo_xl_test.py",
"chars": 9474,
"preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors.\n#\n# Licensed under the Apache License, Version 2.0 "
},
{
"path": "tests/optimization_test.py",
"chars": 3927,
"preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors.\n#\n# Licensed under the Apache License, Version 2.0 "
},
{
"path": "tests/tokenization_gpt2_test.py",
"chars": 3124,
"preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors.\n#\n# Licensed under the Apache License, Version 2.0 "
},
{
"path": "tests/tokenization_openai_test.py",
"chars": 3222,
"preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors.\n#\n# Licensed under the Apache License, Version 2.0 "
},
{
"path": "tests/tokenization_test.py",
"chars": 5090,
"preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors.\n#\n# Licensed under the Apache License, Version 2.0 "
},
{
"path": "tests/tokenization_transfo_xl_test.py",
"chars": 2998,
"preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors.\n#\n# Licensed under the Apache License, Version 2.0 "
}
]
About this extraction
This page contains the full source code of the zhpmatrix/BERTem GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 65 files (1.4 MB), approximately 420.2k tokens, and a symbol index with 925 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.