Full Code of zhpmatrix/BERTem for AI

master 5151c4c304d1 cached
65 files
1.4 MB
420.2k tokens
925 symbols
1 requests
Download .txt
Showing preview only (1,445K chars total). Download the full file or copy to clipboard to get everything.
Repository: zhpmatrix/BERTem
Branch: master
Commit: 5151c4c304d1
Files: 65
Total size: 1.4 MB

Directory structure:
gitextract_mjpetdbh/

├── LICENSE
├── MANIFEST.in
├── README.md
├── docker/
│   └── Dockerfile
├── examples/
│   ├── bertology.py
│   ├── extract_features.py
│   ├── lm_finetuning/
│   │   ├── README.md
│   │   ├── finetune_on_pregenerated.py
│   │   ├── pregenerate_training_data.py
│   │   └── simple_lm_finetuning.py
│   ├── run_classifier.py
│   ├── run_classifier_dataset_utils.py
│   ├── run_gpt2.py
│   ├── run_openai_gpt.py
│   ├── run_squad.py
│   ├── run_squad_dataset_utils.py
│   ├── run_swag.py
│   ├── run_transfo_xl.py
│   ├── sem_run_classifier.py
│   ├── tacred_run_classifier.py
│   ├── tacred_run_infer.py
│   ├── test.sh
│   └── train.sh
├── hubconf.py
├── hubconfs/
│   ├── bert_hubconf.py
│   ├── gpt2_hubconf.py
│   ├── gpt_hubconf.py
│   └── transformer_xl_hubconf.py
├── notebooks/
│   ├── Comparing-PT-and-TF-models.ipynb
│   ├── Comparing-TF-and-PT-models-MLM-NSP.ipynb
│   ├── Comparing-TF-and-PT-models-SQuAD.ipynb
│   └── Comparing-TF-and-PT-models.ipynb
├── pytorch_pretrained_bert/
│   ├── __init__.py
│   ├── __main__.py
│   ├── convert_gpt2_checkpoint_to_pytorch.py
│   ├── convert_openai_checkpoint_to_pytorch.py
│   ├── convert_pytorch_checkpoint_to_tf.py
│   ├── convert_tf_checkpoint_to_pytorch.py
│   ├── convert_transfo_xl_checkpoint_to_pytorch.py
│   ├── file_utils.py
│   ├── modeling.py
│   ├── modeling_gpt2.py
│   ├── modeling_openai.py
│   ├── modeling_transfo_xl.py
│   ├── modeling_transfo_xl_utilities.py
│   ├── optimization.py
│   ├── optimization_openai.py
│   ├── tokenization.py
│   ├── tokenization_gpt2.py
│   ├── tokenization_openai.py
│   └── tokenization_transfo_xl.py
├── requirements.txt
├── samples/
│   ├── input.txt
│   └── sample_text.txt
├── setup.py
└── tests/
    ├── conftest.py
    ├── modeling_gpt2_test.py
    ├── modeling_openai_test.py
    ├── modeling_test.py
    ├── modeling_transfo_xl_test.py
    ├── optimization_test.py
    ├── tokenization_gpt2_test.py
    ├── tokenization_openai_test.py
    ├── tokenization_test.py
    └── tokenization_transfo_xl_test.py

================================================
FILE CONTENTS
================================================

================================================
FILE: LICENSE
================================================

                                 Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/

   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

   1. Definitions.

      "License" shall mean the terms and conditions for use, reproduction,
      and distribution as defined by Sections 1 through 9 of this document.

      "Licensor" shall mean the copyright owner or entity authorized by
      the copyright owner that is granting the License.

      "Legal Entity" shall mean the union of the acting entity and all
      other entities that control, are controlled by, or are under common
      control with that entity. For the purposes of this definition,
      "control" means (i) the power, direct or indirect, to cause the
      direction or management of such entity, whether by contract or
      otherwise, or (ii) ownership of fifty percent (50%) or more of the
      outstanding shares, or (iii) beneficial ownership of such entity.

      "You" (or "Your") shall mean an individual or Legal Entity
      exercising permissions granted by this License.

      "Source" form shall mean the preferred form for making modifications,
      including but not limited to software source code, documentation
      source, and configuration files.

      "Object" form shall mean any form resulting from mechanical
      transformation or translation of a Source form, including but
      not limited to compiled object code, generated documentation,
      and conversions to other media types.

      "Work" shall mean the work of authorship, whether in Source or
      Object form, made available under the License, as indicated by a
      copyright notice that is included in or attached to the work
      (an example is provided in the Appendix below).

      "Derivative Works" shall mean any work, whether in Source or Object
      form, that is based on (or derived from) the Work and for which the
      editorial revisions, annotations, elaborations, or other modifications
      represent, as a whole, an original work of authorship. For the purposes
      of this License, Derivative Works shall not include works that remain
      separable from, or merely link (or bind by name) to the interfaces of,
      the Work and Derivative Works thereof.

      "Contribution" shall mean any work of authorship, including
      the original version of the Work and any modifications or additions
      to that Work or Derivative Works thereof, that is intentionally
      submitted to Licensor for inclusion in the Work by the copyright owner
      or by an individual or Legal Entity authorized to submit on behalf of
      the copyright owner. For the purposes of this definition, "submitted"
      means any form of electronic, verbal, or written communication sent
      to the Licensor or its representatives, including but not limited to
      communication on electronic mailing lists, source code control systems,
      and issue tracking systems that are managed by, or on behalf of, the
      Licensor for the purpose of discussing and improving the Work, but
      excluding communication that is conspicuously marked or otherwise
      designated in writing by the copyright owner as "Not a Contribution."

      "Contributor" shall mean Licensor and any individual or Legal Entity
      on behalf of whom a Contribution has been received by Licensor and
      subsequently incorporated within the Work.

   2. Grant of Copyright License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      copyright license to reproduce, prepare Derivative Works of,
      publicly display, publicly perform, sublicense, and distribute the
      Work and such Derivative Works in Source or Object form.

   3. Grant of Patent License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      (except as stated in this section) patent license to make, have made,
      use, offer to sell, sell, import, and otherwise transfer the Work,
      where such license applies only to those patent claims licensable
      by such Contributor that are necessarily infringed by their
      Contribution(s) alone or by combination of their Contribution(s)
      with the Work to which such Contribution(s) was submitted. If You
      institute patent litigation against any entity (including a
      cross-claim or counterclaim in a lawsuit) alleging that the Work
      or a Contribution incorporated within the Work constitutes direct
      or contributory patent infringement, then any patent licenses
      granted to You under this License for that Work shall terminate
      as of the date such litigation is filed.

   4. Redistribution. You may reproduce and distribute copies of the
      Work or Derivative Works thereof in any medium, with or without
      modifications, and in Source or Object form, provided that You
      meet the following conditions:

      (a) You must give any other recipients of the Work or
          Derivative Works a copy of this License; and

      (b) You must cause any modified files to carry prominent notices
          stating that You changed the files; and

      (c) You must retain, in the Source form of any Derivative Works
          that You distribute, all copyright, patent, trademark, and
          attribution notices from the Source form of the Work,
          excluding those notices that do not pertain to any part of
          the Derivative Works; and

      (d) If the Work includes a "NOTICE" text file as part of its
          distribution, then any Derivative Works that You distribute must
          include a readable copy of the attribution notices contained
          within such NOTICE file, excluding those notices that do not
          pertain to any part of the Derivative Works, in at least one
          of the following places: within a NOTICE text file distributed
          as part of the Derivative Works; within the Source form or
          documentation, if provided along with the Derivative Works; or,
          within a display generated by the Derivative Works, if and
          wherever such third-party notices normally appear. The contents
          of the NOTICE file are for informational purposes only and
          do not modify the License. You may add Your own attribution
          notices within Derivative Works that You distribute, alongside
          or as an addendum to the NOTICE text from the Work, provided
          that such additional attribution notices cannot be construed
          as modifying the License.

      You may add Your own copyright statement to Your modifications and
      may provide additional or different license terms and conditions
      for use, reproduction, or distribution of Your modifications, or
      for any such Derivative Works as a whole, provided Your use,
      reproduction, and distribution of the Work otherwise complies with
      the conditions stated in this License.

   5. Submission of Contributions. Unless You explicitly state otherwise,
      any Contribution intentionally submitted for inclusion in the Work
      by You to the Licensor shall be under the terms and conditions of
      this License, without any additional terms or conditions.
      Notwithstanding the above, nothing herein shall supersede or modify
      the terms of any separate license agreement you may have executed
      with Licensor regarding such Contributions.

   6. Trademarks. This License does not grant permission to use the trade
      names, trademarks, service marks, or product names of the Licensor,
      except as required for reasonable and customary use in describing the
      origin of the Work and reproducing the content of the NOTICE file.

   7. Disclaimer of Warranty. Unless required by applicable law or
      agreed to in writing, Licensor provides the Work (and each
      Contributor provides its Contributions) on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
      implied, including, without limitation, any warranties or conditions
      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
      PARTICULAR PURPOSE. You are solely responsible for determining the
      appropriateness of using or redistributing the Work and assume any
      risks associated with Your exercise of permissions under this License.

   8. Limitation of Liability. In no event and under no legal theory,
      whether in tort (including negligence), contract, or otherwise,
      unless required by applicable law (such as deliberate and grossly
      negligent acts) or agreed to in writing, shall any Contributor be
      liable to You for damages, including any direct, indirect, special,
      incidental, or consequential damages of any character arising as a
      result of this License or out of the use or inability to use the
      Work (including but not limited to damages for loss of goodwill,
      work stoppage, computer failure or malfunction, or any and all
      other commercial damages or losses), even if such Contributor
      has been advised of the possibility of such damages.

   9. Accepting Warranty or Additional Liability. While redistributing
      the Work or Derivative Works thereof, You may choose to offer,
      and charge a fee for, acceptance of support, warranty, indemnity,
      or other liability obligations and/or rights consistent with this
      License. However, in accepting such obligations, You may act only
      on Your own behalf and on Your sole responsibility, not on behalf
      of any other Contributor, and only if You agree to indemnify,
      defend, and hold each Contributor harmless for any liability
      incurred by, or claims asserted against, such Contributor by reason
      of your accepting any such warranty or additional liability.

   END OF TERMS AND CONDITIONS

   APPENDIX: How to apply the Apache License to your work.

      To apply the Apache License to your work, attach the following
      boilerplate notice, with the fields enclosed by brackets "[]"
      replaced with your own identifying information. (Don't include
      the brackets!)  The text should be enclosed in the appropriate
      comment syntax for the file format. We also recommend that a
      file or class name and description of purpose be included on the
      same "printed page" as the copyright notice for easier
      identification within third-party archives.

   Copyright [yyyy] [name of copyright owner]

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.


================================================
FILE: MANIFEST.in
================================================
include LICENSE


================================================
FILE: README.md
================================================
### 实现说明

主要实现文章前半部分的工作,PyTorch实现,基于[huggingface](https://github.com/huggingface/pytorch-pretrained-BERT)的工作,PyTorch才是世界上最屌的框架,逃。

### 实现参考

![img1](http://wx2.sinaimg.cn/mw690/aba7d18bgy1g47p0g5ln3j210n0drtas.jpg)


### 代码说明

(1)主要修改:[modeling.py](https://github.com/zhpmatrix/BERTem/blob/master/pytorch_pretrained_bert/modeling.py)

output representation: **BertForSequenceClassification**

input representation:  **BertEmbeddings**

input和output都实现了多种策略,可以结合具体的任务,找到最佳的组合。


(2)非主要实现:examples下的关于classification的文件

(3)服务部署:基于Flask,可以在本地开启一个服务。具体实现在[tacred\_run\_infer.py](https://github.com/zhpmatrix/BERTem/blob/master/examples/tacred_run_infer.py)中。

(4)代码仅供参考,不提供数据集,不提供预训练模型,不提供训练后的模型(希望理解吧)。

(5)相关工作可以参考[我的博客-神经关系抽取](https://zhpmatrix.github.io/2019/06/30/neural-relation-extraction/),可能比这个代码更有价值一些吧。


### 实现结果:

 数据集TACRED上的结果:

|模型序号|输入类型|输出类型|指标类型|P|R|F1|备注|
|------|------|------|------|------|------|------|------|
|0|entity marker|sum(entity start)|micro|**0.68**|**0.63**|**0.65**|**base-model**,lr=3e-5,epoch=3|
||||macro|**0.60**|**0.54**|**0.55**|
|1|entity marker|sum(entity start)|micro|**0.70**|**0.62**|**0.65**|**large-model**,lr=3e-5,epoch=1|
||||macro|**0.63**|**0.52**|**0.55**|
|-1|None|None|micro|**0.69**|**0.66**|**0.67**|手误之后,再也找不到了,尴尬|||
||||macro|**0.58**|**0.50**|**0.53**||||


数据集SemEval2010 Task 8上的结果:

|模型序号|输入类型|输出类型|指标类型|P|R|F1|备注|
|------|------|------|------|------|------|------|------|
|0|entity marker|maxpool(entity emb)+relu|micro|**0.86**|**0.86**|**0.86**|bert-large|
||||macro|**0.82**|**0.83**|**0.82**||||


### 混合精度加速结果

在具体任务上,延续之前的setting,将train和dev合并共同作为新的train集,test集不变。在fp32
和fp16的两种setting下,比较相同batch\_size下,一个epoch的用时或者每个迭代的用时。

|比较方面|fp32|fp16|备注|
|------|------|------|------|
|训练阶段|1.04it/s|4.41it/s|12.76it/s(独占显卡)|
|推断阶段|4.14it/s|8.63it/s||
|测试集指标|0.65/0.55|0.64/0.53|格式:micro/macro|
|模型大小|421M|212M||


================================================
FILE: docker/Dockerfile
================================================
FROM pytorch/pytorch:latest

RUN git clone https://github.com/NVIDIA/apex.git && cd apex && python setup.py install --cuda_ext --cpp_ext

RUN pip install pytorch-pretrained-bert

WORKDIR /workspace

================================================
FILE: examples/bertology.py
================================================
#!/usr/bin/env python3
import os
import argparse
import logging
from datetime import timedelta, datetime
from tqdm import tqdm

import numpy as np

import torch
from torch.utils.data import DataLoader, SequentialSampler, TensorDataset, Subset
from torch.utils.data.distributed import DistributedSampler
from torch.nn import CrossEntropyLoss, MSELoss

from pytorch_pretrained_bert import BertForSequenceClassification, BertTokenizer

from run_classifier_dataset_utils import processors, output_modes, convert_examples_to_features, compute_metrics


logger = logging.getLogger(__name__)


def entropy(p):
    plogp = p * torch.log(p)
    plogp[p == 0] = 0
    return -plogp.sum(dim=-1)


def print_1d_tensor(tensor, prefix=""):
    if tensor.dtype != torch.long:
        logger.info(prefix + "\t".join(f"{x:.5f}" for x in tensor.cpu().data))
    else:
        logger.info(prefix + "\t".join(f"{x:d}" for x in tensor.cpu().data))


def print_2d_tensor(tensor):
    logger.info("lv, h >\t" + "\t".join(f"{x + 1}" for x in range(len(tensor))))
    for row in range(len(tensor)):
        print_1d_tensor(tensor[row], prefix=f"layer {row + 1}:\t")


def compute_heads_importance(args, model, eval_dataloader, compute_entropy=True, compute_importance=True, head_mask=None):
    """ Example on how to use model outputs to compute:
        - head attention entropy (activated by setting output_attentions=True when we created the model
        - head importance scores according to http://arxiv.org/abs/1905.10650
            (activated by setting keep_multihead_output=True when we created the model)
    """
    # Prepare our tensors
    n_layers, n_heads = model.bert.config.num_hidden_layers, model.bert.config.num_attention_heads
    head_importance = torch.zeros(n_layers, n_heads).to(args.device)
    attn_entropy = torch.zeros(n_layers, n_heads).to(args.device)
    preds = None
    labels = None
    tot_tokens = 0.0

    for step, batch in enumerate(tqdm(eval_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])):
        batch = tuple(t.to(args.device) for t in batch)
        input_ids, input_mask, segment_ids, label_ids = batch

        # Do a forward pass (not with torch.no_grad() since we need gradients for importance score - see below)
        all_attentions, logits = model(input_ids, token_type_ids=segment_ids, attention_mask=input_mask, head_mask=head_mask)

        if compute_entropy:
            # Update head attention entropy
            for layer, attn in enumerate(all_attentions):
                masked_entropy = entropy(attn.detach()) * input_mask.float().unsqueeze(1)
                attn_entropy[layer] += masked_entropy.sum(-1).sum(0).detach()

        if compute_importance:
            # Update head importance scores with regards to our loss
            # First, backpropagate to populate the gradients
            if args.output_mode == "classification":
                loss_fct = CrossEntropyLoss()
                loss = loss_fct(logits.view(-1, args.num_labels), label_ids.view(-1))
            elif args.output_mode == "regression":
                loss_fct = MSELoss()
                loss = loss_fct(logits.view(-1), label_ids.view(-1))
            loss.backward()
            # Second, compute importance scores according to http://arxiv.org/abs/1905.10650
            multihead_outputs = model.bert.get_multihead_outputs()
            for layer, mh_layer_output in enumerate(multihead_outputs):
                dot = torch.einsum("bhli,bhli->bhl", [mh_layer_output.grad, mh_layer_output])
                head_importance[layer] += dot.abs().sum(-1).sum(0).detach()

        # Also store our logits/labels if we want to compute metrics afterwards
        if preds is None:
            preds = logits.detach().cpu().numpy()
            labels = label_ids.detach().cpu().numpy()
        else:
            preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
            labels = np.append(labels, label_ids.detach().cpu().numpy(), axis=0)

        tot_tokens += input_mask.float().detach().sum().data

    # Normalize
    attn_entropy /= tot_tokens
    head_importance /= tot_tokens
    # Layerwise importance normalization
    if not args.dont_normalize_importance_by_layer:
        exponent = 2
        norm_by_layer = torch.pow(torch.pow(head_importance, exponent).sum(-1), 1/exponent)
        head_importance /= norm_by_layer.unsqueeze(-1) + 1e-20

    if not args.dont_normalize_global_importance:
        head_importance = (head_importance - head_importance.min()) / (head_importance.max() - head_importance.min())

    return attn_entropy, head_importance, preds, labels


def run_model():
    parser = argparse.ArgumentParser()
    parser.add_argument('--model_name_or_path', type=str, default='bert-base-cased-finetuned-mrpc', help='pretrained model name or path to local checkpoint')
    parser.add_argument("--task_name", type=str, default='mrpc', help="The name of the task to train.")
    parser.add_argument("--data_dir", type=str, required=True, help="The input data dir. Should contain the .tsv files (or other data files) for the task.")
    parser.add_argument("--output_dir", type=str, required=True, help="The output directory where the model predictions and checkpoints will be written.")
    parser.add_argument("--data_subset", type=int, default=-1, help="If > 0: limit the data to a subset of data_subset instances.")
    parser.add_argument("--overwrite_output_dir", action='store_true', help="Whether to overwrite data in output directory")

    parser.add_argument("--dont_normalize_importance_by_layer", action='store_true', help="Don't normalize importance score by layers")
    parser.add_argument("--dont_normalize_global_importance", action='store_true', help="Don't normalize all importance scores between 0 and 1")

    parser.add_argument("--try_masking", action='store_true', help="Whether to try to mask head until a threshold of accuracy.")
    parser.add_argument("--masking_threshold", default=0.9, type=float, help="masking threshold in term of metrics"
                                                                             "(stop masking when metric < threshold * original metric value).")
    parser.add_argument("--masking_amount", default=0.1, type=float, help="Amount to heads to masking at each masking step.")
    parser.add_argument("--metric_name", default="acc", type=str, help="Metric to use for head masking.")

    parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after WordPiece tokenization. \n"
                             "Sequences longer than this will be truncated, and sequences shorter \n"
                             "than this will be padded.")
    parser.add_argument("--batch_size", default=1, type=int, help="Batch size.")

    parser.add_argument("--seed", type=int, default=42)
    parser.add_argument("--local_rank", type=int, default=-1, help="local_rank for distributed training on gpus")
    parser.add_argument("--no_cuda", action='store_true', help="Whether not to use CUDA when available")
    parser.add_argument('--server_ip', type=str, default='', help="Can be used for distant debugging.")
    parser.add_argument('--server_port', type=str, default='', help="Can be used for distant debugging.")
    args = parser.parse_args()

    if args.server_ip and args.server_port:
        # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
        import ptvsd
        print("Waiting for debugger attach")
        ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
        ptvsd.wait_for_attach()

    # Setup devices and distributed training
    if args.local_rank == -1 or args.no_cuda:
        args.device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
        n_gpu = torch.cuda.device_count()
    else:
        torch.cuda.set_device(args.local_rank)
        args.device = torch.device("cuda", args.local_rank)
        n_gpu = 1
        torch.distributed.init_process_group(backend='nccl')  # Initializes the distributed backend

    # Setup logging
    logging.basicConfig(level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
    logger.info("device: {} n_gpu: {}, distributed: {}".format(args.device, n_gpu, bool(args.local_rank != -1)))

    # Set seeds
    np.random.seed(args.seed)
    torch.random.manual_seed(args.seed)
    if n_gpu > 0:
        torch.cuda.manual_seed(args.seed)

    # Prepare GLUE task
    task_name = args.task_name.lower()
    processor = processors[task_name]()
    label_list = processor.get_labels()
    args.output_mode = output_modes[task_name]
    args.num_labels = len(label_list)

    # Prepare output directory
    if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and not args.overwrite_output_dir:
        raise ValueError("Output directory ({}) already exists and is not empty.".format(args.output_dir))
    if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
        os.makedirs(args.output_dir)

    # Load model & tokenizer
    if args.local_rank not in [-1, 0]:
        torch.distributed.barrier()  # Make sure only one distributed process download model & vocab
    tokenizer = BertTokenizer.from_pretrained(args.model_name_or_path)

    # Load a model with all BERTology options on:
    #   output_attentions => will output attention weights
    #   keep_multihead_output => will store gradient of attention head outputs for head importance computation
    #       see: http://arxiv.org/abs/1905.10650
    model = BertForSequenceClassification.from_pretrained(args.model_name_or_path,
                                                          num_labels=args.num_labels,
                                                          output_attentions=True,
                                                          keep_multihead_output=True)
    if args.local_rank == 0:
        torch.distributed.barrier()  # Make sure only one distributed process download model & vocab
    model.to(args.device)
    if args.local_rank != -1:
        model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True)
    model.eval()

    # Prepare dataset for the GLUE task
    eval_examples = processor.get_dev_examples(args.data_dir)
    cached_eval_features_file = os.path.join(args.data_dir, 'dev_{0}_{1}_{2}'.format(
        list(filter(None, args.model_name_or_path.split('/'))).pop(), str(args.max_seq_length), str(task_name)))
    try:
        eval_features = torch.load(cached_eval_features_file)
    except:
        eval_features = convert_examples_to_features(eval_examples, label_list, args.max_seq_length, tokenizer, args.output_mode)
        if args.local_rank in [-1, 0]:
            logger.info("Saving eval features to cache file %s", cached_eval_features_file)
            torch.save(eval_features, cached_eval_features_file)

    all_input_ids = torch.tensor([f.input_ids for f in eval_features], dtype=torch.long)
    all_input_mask = torch.tensor([f.input_mask for f in eval_features], dtype=torch.long)
    all_segment_ids = torch.tensor([f.segment_ids for f in eval_features], dtype=torch.long)
    all_label_ids = torch.tensor([f.label_id for f in eval_features], dtype=torch.long if args.output_mode == "classification" else torch.float)
    eval_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids)

    if args.data_subset > 0:
        eval_data = Subset(eval_data, list(range(min(args.data_subset, len(eval_data)))))

    eval_sampler = SequentialSampler(eval_data) if args.local_rank == -1 else DistributedSampler(eval_data)
    eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=args.batch_size)

    # Print/save training arguments
    print(args)
    torch.save(args, os.path.join(args.output_dir, 'run_args.bin'))

    # Compute head entropy and importance score
    attn_entropy, head_importance, _, _ = compute_heads_importance(args, model, eval_dataloader)

    # Print/save matrices
    np.save(os.path.join(args.output_dir, 'attn_entropy.npy'), attn_entropy.detach().cpu().numpy())
    np.save(os.path.join(args.output_dir, 'head_importance.npy'), head_importance.detach().cpu().numpy())

    logger.info("Attention entropies")
    print_2d_tensor(attn_entropy)
    logger.info("Head importance scores")
    print_2d_tensor(head_importance)
    logger.info("Head ranked by importance scores")
    head_ranks = torch.zeros(head_importance.numel(), dtype=torch.long, device=args.device)
    head_ranks[head_importance.view(-1).sort(descending=True)[1]] = torch.arange(head_importance.numel(), device=args.device)
    head_ranks = head_ranks.view_as(head_importance)
    print_2d_tensor(head_ranks)

    # Do masking if we want to
    if args.try_masking and args.masking_threshold > 0.0 and args.masking_threshold < 1.0:
        _, head_importance, preds, labels = compute_heads_importance(args, model, eval_dataloader, compute_entropy=False)
        preds = np.argmax(preds, axis=1) if args.output_mode == "classification" else np.squeeze(preds)
        original_score = compute_metrics(task_name, preds, labels)[args.metric_name]
        logger.info("Pruning: original score: %f, threshold: %f", original_score, original_score * args.masking_threshold)

        new_head_mask = torch.ones_like(head_importance)
        num_to_mask = max(1, int(new_head_mask.numel() * args.masking_amount))

        current_score = original_score
        while current_score >= original_score * args.masking_threshold:
            head_mask = new_head_mask.clone() # save current head mask
            # heads from least important to most - keep only not-masked heads
            head_importance[head_mask == 0.0] = float('Inf')
            current_heads_to_mask = head_importance.view(-1).sort()[1]

            if len(current_heads_to_mask) <= num_to_mask:
                break

            # mask heads
            current_heads_to_mask = current_heads_to_mask[:num_to_mask]
            logger.info("Heads to mask: %s", str(current_heads_to_mask.tolist()))
            new_head_mask = new_head_mask.view(-1)
            new_head_mask[current_heads_to_mask] = 0.0
            new_head_mask = new_head_mask.view_as(head_mask)
            print_2d_tensor(new_head_mask)

            # Compute metric and head importance again
            _, head_importance, preds, labels = compute_heads_importance(args, model, eval_dataloader, compute_entropy=False, head_mask=new_head_mask)
            preds = np.argmax(preds, axis=1) if args.output_mode == "classification" else np.squeeze(preds)
            current_score = compute_metrics(task_name, preds, labels)[args.metric_name]
            logger.info("Masking: current score: %f, remaning heads %d (%.1f percents)", current_score, new_head_mask.sum(), new_head_mask.sum()/new_head_mask.numel() * 100)

        logger.info("Final head mask")
        print_2d_tensor(head_mask)
        np.save(os.path.join(args.output_dir, 'head_mask.npy'), head_mask.detach().cpu().numpy())

        # Try pruning and test time speedup
        # Pruning is like masking but we actually remove the masked weights
        before_time = datetime.now()
        _, _, preds, labels = compute_heads_importance(args, model, eval_dataloader,
                                                       compute_entropy=False, compute_importance=False, head_mask=head_mask)
        preds = np.argmax(preds, axis=1) if args.output_mode == "classification" else np.squeeze(preds)
        score_masking = compute_metrics(task_name, preds, labels)[args.metric_name]
        original_time = datetime.now() - before_time

        original_num_params = sum(p.numel() for p in model.parameters())
        heads_to_prune = dict((layer, (1 - head_mask[layer].long()).nonzero().tolist()) for layer in range(len(head_mask)))
        assert sum(len(h) for h in heads_to_prune.values()) == (1 - head_mask.long()).sum().item()
        model.bert.prune_heads(heads_to_prune)
        pruned_num_params = sum(p.numel() for p in model.parameters())

        before_time = datetime.now()
        _, _, preds, labels = compute_heads_importance(args, model, eval_dataloader,
                                                       compute_entropy=False, compute_importance=False, head_mask=None)
        preds = np.argmax(preds, axis=1) if args.output_mode == "classification" else np.squeeze(preds)
        score_pruning = compute_metrics(task_name, preds, labels)[args.metric_name]
        new_time = datetime.now() - before_time

        logger.info("Pruning: original num of params: %.2e, after pruning %.2e (%.1f percents)", original_num_params, pruned_num_params, pruned_num_params/original_num_params * 100)
        logger.info("Pruning: score with masking: %f score with pruning: %f", score_masking, score_pruning)
        logger.info("Pruning: speed ratio (new timing / original timing): %f percents", original_time/new_time * 100)

if __name__ == '__main__':
    run_model()


================================================
FILE: examples/extract_features.py
================================================
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Extract pre-computed feature vectors from a PyTorch BERT model."""

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import argparse
import collections
import logging
import json
import re

import torch
from torch.utils.data import TensorDataset, DataLoader, SequentialSampler
from torch.utils.data.distributed import DistributedSampler

from pytorch_pretrained_bert.tokenization import BertTokenizer
from pytorch_pretrained_bert.modeling import BertModel

logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s', 
                    datefmt = '%m/%d/%Y %H:%M:%S',
                    level = logging.INFO)
logger = logging.getLogger(__name__)


class InputExample(object):

    def __init__(self, unique_id, text_a, text_b):
        self.unique_id = unique_id
        self.text_a = text_a
        self.text_b = text_b


class InputFeatures(object):
    """A single set of features of data."""

    def __init__(self, unique_id, tokens, input_ids, input_mask, input_type_ids):
        self.unique_id = unique_id
        self.tokens = tokens
        self.input_ids = input_ids
        self.input_mask = input_mask
        self.input_type_ids = input_type_ids


def convert_examples_to_features(examples, seq_length, tokenizer):
    """Loads a data file into a list of `InputFeature`s."""

    features = []
    for (ex_index, example) in enumerate(examples):
        tokens_a = tokenizer.tokenize(example.text_a)

        tokens_b = None
        if example.text_b:
            tokens_b = tokenizer.tokenize(example.text_b)

        if tokens_b:
            # Modifies `tokens_a` and `tokens_b` in place so that the total
            # length is less than the specified length.
            # Account for [CLS], [SEP], [SEP] with "- 3"
            _truncate_seq_pair(tokens_a, tokens_b, seq_length - 3)
        else:
            # Account for [CLS] and [SEP] with "- 2"
            if len(tokens_a) > seq_length - 2:
                tokens_a = tokens_a[0:(seq_length - 2)]

        # The convention in BERT is:
        # (a) For sequence pairs:
        #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
        #  type_ids:   0   0  0    0    0     0      0   0    1  1  1   1  1   1
        # (b) For single sequences:
        #  tokens:   [CLS] the dog is hairy . [SEP]
        #  type_ids:   0   0   0   0  0     0   0
        #
        # Where "type_ids" are used to indicate whether this is the first
        # sequence or the second sequence. The embedding vectors for `type=0` and
        # `type=1` were learned during pre-training and are added to the wordpiece
        # embedding vector (and position vector). This is not *strictly* necessary
        # since the [SEP] token unambigiously separates the sequences, but it makes
        # it easier for the model to learn the concept of sequences.
        #
        # For classification tasks, the first vector (corresponding to [CLS]) is
        # used as as the "sentence vector". Note that this only makes sense because
        # the entire model is fine-tuned.
        tokens = []
        input_type_ids = []
        tokens.append("[CLS]")
        input_type_ids.append(0)
        for token in tokens_a:
            tokens.append(token)
            input_type_ids.append(0)
        tokens.append("[SEP]")
        input_type_ids.append(0)

        if tokens_b:
            for token in tokens_b:
                tokens.append(token)
                input_type_ids.append(1)
            tokens.append("[SEP]")
            input_type_ids.append(1)

        input_ids = tokenizer.convert_tokens_to_ids(tokens)

        # The mask has 1 for real tokens and 0 for padding tokens. Only real
        # tokens are attended to.
        input_mask = [1] * len(input_ids)

        # Zero-pad up to the sequence length.
        while len(input_ids) < seq_length:
            input_ids.append(0)
            input_mask.append(0)
            input_type_ids.append(0)

        assert len(input_ids) == seq_length
        assert len(input_mask) == seq_length
        assert len(input_type_ids) == seq_length

        if ex_index < 5:
            logger.info("*** Example ***")
            logger.info("unique_id: %s" % (example.unique_id))
            logger.info("tokens: %s" % " ".join([str(x) for x in tokens]))
            logger.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
            logger.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
            logger.info(
                "input_type_ids: %s" % " ".join([str(x) for x in input_type_ids]))

        features.append(
            InputFeatures(
                unique_id=example.unique_id,
                tokens=tokens,
                input_ids=input_ids,
                input_mask=input_mask,
                input_type_ids=input_type_ids))
    return features


def _truncate_seq_pair(tokens_a, tokens_b, max_length):
    """Truncates a sequence pair in place to the maximum length."""

    # This is a simple heuristic which will always truncate the longer sequence
    # one token at a time. This makes more sense than truncating an equal percent
    # of tokens from each, since if one sequence is very short then each token
    # that's truncated likely contains more information than a longer sequence.
    while True:
        total_length = len(tokens_a) + len(tokens_b)
        if total_length <= max_length:
            break
        if len(tokens_a) > len(tokens_b):
            tokens_a.pop()
        else:
            tokens_b.pop()


def read_examples(input_file):
    """Read a list of `InputExample`s from an input file."""
    examples = []
    unique_id = 0
    with open(input_file, "r", encoding='utf-8') as reader:
        while True:
            line = reader.readline()
            if not line:
                break
            line = line.strip()
            text_a = None
            text_b = None
            m = re.match(r"^(.*) \|\|\| (.*)$", line)
            if m is None:
                text_a = line
            else:
                text_a = m.group(1)
                text_b = m.group(2)
            examples.append(
                InputExample(unique_id=unique_id, text_a=text_a, text_b=text_b))
            unique_id += 1
    return examples


def main():
    parser = argparse.ArgumentParser()

    ## Required parameters
    parser.add_argument("--input_file", default=None, type=str, required=True)
    parser.add_argument("--output_file", default=None, type=str, required=True)
    parser.add_argument("--bert_model", default=None, type=str, required=True,
                        help="Bert pre-trained model selected in the list: bert-base-uncased, "
                             "bert-large-uncased, bert-base-cased, bert-base-multilingual, bert-base-chinese.")

    ## Other parameters
    parser.add_argument("--do_lower_case", action='store_true', help="Set this flag if you are using an uncased model.")
    parser.add_argument("--layers", default="-1,-2,-3,-4", type=str)
    parser.add_argument("--max_seq_length", default=128, type=int,
                        help="The maximum total input sequence length after WordPiece tokenization. Sequences longer "
                            "than this will be truncated, and sequences shorter than this will be padded.")
    parser.add_argument("--batch_size", default=32, type=int, help="Batch size for predictions.")
    parser.add_argument("--local_rank",
                        type=int,
                        default=-1,
                        help = "local_rank for distributed training on gpus")
    parser.add_argument("--no_cuda",
                        action='store_true',
                        help="Whether not to use CUDA when available")

    args = parser.parse_args()

    if args.local_rank == -1 or args.no_cuda:
        device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
        n_gpu = torch.cuda.device_count()
    else:
        device = torch.device("cuda", args.local_rank)
        n_gpu = 1
        # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
        torch.distributed.init_process_group(backend='nccl')
    logger.info("device: {} n_gpu: {} distributed training: {}".format(device, n_gpu, bool(args.local_rank != -1)))

    layer_indexes = [int(x) for x in args.layers.split(",")]

    tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case)

    examples = read_examples(args.input_file)

    features = convert_examples_to_features(
        examples=examples, seq_length=args.max_seq_length, tokenizer=tokenizer)

    unique_id_to_feature = {}
    for feature in features:
        unique_id_to_feature[feature.unique_id] = feature

    model = BertModel.from_pretrained(args.bert_model)
    model.to(device)

    if args.local_rank != -1:
        model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
                                                          output_device=args.local_rank)
    elif n_gpu > 1:
        model = torch.nn.DataParallel(model)

    all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
    all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
    all_example_index = torch.arange(all_input_ids.size(0), dtype=torch.long)

    eval_data = TensorDataset(all_input_ids, all_input_mask, all_example_index)
    if args.local_rank == -1:
        eval_sampler = SequentialSampler(eval_data)
    else:
        eval_sampler = DistributedSampler(eval_data)
    eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=args.batch_size)

    model.eval()
    with open(args.output_file, "w", encoding='utf-8') as writer:
        for input_ids, input_mask, example_indices in eval_dataloader:
            input_ids = input_ids.to(device)
            input_mask = input_mask.to(device)

            all_encoder_layers, _ = model(input_ids, token_type_ids=None, attention_mask=input_mask)
            all_encoder_layers = all_encoder_layers

            for b, example_index in enumerate(example_indices):
                feature = features[example_index.item()]
                unique_id = int(feature.unique_id)
                # feature = unique_id_to_feature[unique_id]
                output_json = collections.OrderedDict()
                output_json["linex_index"] = unique_id
                all_out_features = []
                for (i, token) in enumerate(feature.tokens):
                    all_layers = []
                    for (j, layer_index) in enumerate(layer_indexes):
                        layer_output = all_encoder_layers[int(layer_index)].detach().cpu().numpy()
                        layer_output = layer_output[b]
                        layers = collections.OrderedDict()
                        layers["index"] = layer_index
                        layers["values"] = [
                            round(x.item(), 6) for x in layer_output[i]
                        ]
                        all_layers.append(layers)
                    out_features = collections.OrderedDict()
                    out_features["token"] = token
                    out_features["layers"] = all_layers
                    all_out_features.append(out_features)
                output_json["features"] = all_out_features
                writer.write(json.dumps(output_json) + "\n")


if __name__ == "__main__":
    main()


================================================
FILE: examples/lm_finetuning/README.md
================================================
# BERT Model Finetuning using Masked Language Modeling objective

## Introduction

The three example scripts in this folder can be used to **fine-tune** a pre-trained BERT model using the pretraining objective (combination of masked language modeling and next sentence prediction loss). In general, pretrained models like BERT are first trained with a pretraining objective (masked language modeling and next sentence prediction for BERT) on a large and general natural language corpus. A classifier head is then added on top of the pre-trained architecture and the model is quickly fine-tuned on a target task, while still (hopefully) retaining its general language understanding. This greatly reduces overfitting and yields state-of-the-art results, especially when training data for the target task are limited.

The [ULMFiT paper](https://arxiv.org/abs/1801.06146) took a slightly different approach, however, and added an intermediate step in which the model is fine-tuned on text **from the same domain as the target task and using the pretraining objective** before the final stage in which the classifier head is added and the model is trained on the target task itself. This paper reported significantly improved results from this step, and found that they could get high-quality classifications even with only tiny numbers (<1000) of labelled training examples, as long as they had a lot of unlabelled data from the target domain.

Although this wasn't covered in the original BERT paper, domain-specific fine-tuning of Transformer models has [recently been reported by other authors](https://arxiv.org/pdf/1905.05583.pdf), and they report performance improvements as well.

## Input format

The scripts in this folder expect a single file as input, consisting of untokenized text, with one **sentence** per line, and one blank line between documents. The reason for the sentence splitting is that part of BERT's training involves a _next sentence_ objective in which the model must predict whether two sequences of text are contiguous text from the same document or not, and to avoid making the task _too easy_, the split point between the sequences is always at the end of a sentence. The linebreaks in the file are therefore necessary to mark the points where the text can be split.

## Usage

There are two ways to fine-tune a language model using these scripts. The first _quick_ approach is to use [`simple_lm_finetuning.py`](./simple_lm_finetuning.py). This script does everything in a single script, but generates training instances that consist of just two sentences. This is quite different from the BERT paper, where (confusingly) the NextSentence task concatenated sentences together from each document to form two long multi-sentences, which the paper just referred to as _sentences_. The difference between this simple approach and the original paper approach can have a significant effect for long sequences since two sentences will be much shorter than the max sequence length. In this case, most of each training example will just consist of blank padding characters, which wastes a lot of computation and results in a model that isn't really training on long sequences.

As such, the preferred approach (assuming you have documents containing multiple contiguous sentences from your target domain) is to use [`pregenerate_training_data.py`](./pregenerate_training_data.py) to pre-process your data into training examples following the methodology used for LM training in the original BERT paper and repository. Since there is a significant random component to training data generation for BERT, this script includes an option to generate multiple _epochs_ of pre-processed data, to avoid training on the same random splits each epoch. Generating an epoch of data for each training epoch should result a better final model, and so we recommend doing so.

You can then train on the pregenerated data using [`finetune_on_pregenerated.py`](./finetune_on_pregenerated.py), and pointing it to the folder created by [`pregenerate_training_data.py`](./pregenerate_training_data.py). Note that you should use the same `bert_model` and case options for both! Also note that `max_seq_len` does not need to be specified for the [`finetune_on_pregenerated.py`](./finetune_on_pregenerated.py) script, as it is inferred from the training examples.

There are various options that can be tweaked, but they are mostly set to the values from the BERT paper/repository and default values should make sense. The most relevant ones are:

- `--max_seq_len`: Controls the length of training examples (in wordpiece tokens) seen by the model. Defaults to 128 but can be set as high as 512. Higher values may yield stronger language models at the cost of slower and more memory-intensive training.
- `--fp16`: Enables fast half-precision training on recent GPUs.

In addition, if memory usage is an issue, especially when training on a single GPU, reducing `--train_batch_size` from the default 32 to a lower number (4-16) can be helpful, or leaving `--train_batch_size` at the default and increasing `--gradient_accumulation_steps` to 2-8. Changing `--gradient_accumulation_steps` may be preferable as alterations to the batch size may require corresponding changes in the learning rate to compensate. There is also a `--reduce_memory` option for both the `pregenerate_training_data.py` and `finetune_on_pregenerated.py` scripts that spills data to disc in shelf objects or numpy memmaps rather than retaining it in memory, which significantly reduces memory usage with little performance impact.

## Examples

### Simple fine-tuning

```
python3 simple_lm_finetuning.py 
--train_corpus my_corpus.txt 
--bert_model bert-base-uncased 
--do_lower_case 
--output_dir finetuned_lm/
--do_train
```

### Pregenerating training data

```
python3 pregenerate_training_data.py
--train_corpus my_corpus.txt
--bert_model bert-base-uncased
--do_lower_case
--output_dir training/
--epochs_to_generate 3
--max_seq_len 256
```

### Training on pregenerated data

```
python3 finetune_on_pregenerated.py
--pregenerated_data training/
--bert_model bert-base-uncased
--do_lower_case
--output_dir finetuned_lm/
--epochs 3
```


================================================
FILE: examples/lm_finetuning/finetune_on_pregenerated.py
================================================
from argparse import ArgumentParser
from pathlib import Path
import os
import torch
import logging
import json
import random
import numpy as np
from collections import namedtuple
from tempfile import TemporaryDirectory

from torch.utils.data import DataLoader, Dataset, RandomSampler
from torch.utils.data.distributed import DistributedSampler
from tqdm import tqdm

from pytorch_pretrained_bert import WEIGHTS_NAME, CONFIG_NAME
from pytorch_pretrained_bert.modeling import BertForPreTraining
from pytorch_pretrained_bert.tokenization import BertTokenizer
from pytorch_pretrained_bert.optimization import BertAdam, WarmupLinearSchedule

InputFeatures = namedtuple("InputFeatures", "input_ids input_mask segment_ids lm_label_ids is_next")

log_format = '%(asctime)-10s: %(message)s'
logging.basicConfig(level=logging.INFO, format=log_format)


def convert_example_to_features(example, tokenizer, max_seq_length):
    tokens = example["tokens"]
    segment_ids = example["segment_ids"]
    is_random_next = example["is_random_next"]
    masked_lm_positions = example["masked_lm_positions"]
    masked_lm_labels = example["masked_lm_labels"]

    assert len(tokens) == len(segment_ids) <= max_seq_length  # The preprocessed data should be already truncated
    input_ids = tokenizer.convert_tokens_to_ids(tokens)
    masked_label_ids = tokenizer.convert_tokens_to_ids(masked_lm_labels)

    input_array = np.zeros(max_seq_length, dtype=np.int)
    input_array[:len(input_ids)] = input_ids

    mask_array = np.zeros(max_seq_length, dtype=np.bool)
    mask_array[:len(input_ids)] = 1

    segment_array = np.zeros(max_seq_length, dtype=np.bool)
    segment_array[:len(segment_ids)] = segment_ids

    lm_label_array = np.full(max_seq_length, dtype=np.int, fill_value=-1)
    lm_label_array[masked_lm_positions] = masked_label_ids

    features = InputFeatures(input_ids=input_array,
                             input_mask=mask_array,
                             segment_ids=segment_array,
                             lm_label_ids=lm_label_array,
                             is_next=is_random_next)
    return features


class PregeneratedDataset(Dataset):
    def __init__(self, training_path, epoch, tokenizer, num_data_epochs, reduce_memory=False):
        self.vocab = tokenizer.vocab
        self.tokenizer = tokenizer
        self.epoch = epoch
        self.data_epoch = epoch % num_data_epochs
        data_file = training_path / f"epoch_{self.data_epoch}.json"
        metrics_file = training_path / f"epoch_{self.data_epoch}_metrics.json"
        assert data_file.is_file() and metrics_file.is_file()
        metrics = json.loads(metrics_file.read_text())
        num_samples = metrics['num_training_examples']
        seq_len = metrics['max_seq_len']
        self.temp_dir = None
        self.working_dir = None
        if reduce_memory:
            self.temp_dir = TemporaryDirectory()
            self.working_dir = Path(self.temp_dir.name)
            input_ids = np.memmap(filename=self.working_dir/'input_ids.memmap',
                                  mode='w+', dtype=np.int32, shape=(num_samples, seq_len))
            input_masks = np.memmap(filename=self.working_dir/'input_masks.memmap',
                                    shape=(num_samples, seq_len), mode='w+', dtype=np.bool)
            segment_ids = np.memmap(filename=self.working_dir/'segment_ids.memmap',
                                    shape=(num_samples, seq_len), mode='w+', dtype=np.bool)
            lm_label_ids = np.memmap(filename=self.working_dir/'lm_label_ids.memmap',
                                     shape=(num_samples, seq_len), mode='w+', dtype=np.int32)
            lm_label_ids[:] = -1
            is_nexts = np.memmap(filename=self.working_dir/'is_nexts.memmap',
                                 shape=(num_samples,), mode='w+', dtype=np.bool)
        else:
            input_ids = np.zeros(shape=(num_samples, seq_len), dtype=np.int32)
            input_masks = np.zeros(shape=(num_samples, seq_len), dtype=np.bool)
            segment_ids = np.zeros(shape=(num_samples, seq_len), dtype=np.bool)
            lm_label_ids = np.full(shape=(num_samples, seq_len), dtype=np.int32, fill_value=-1)
            is_nexts = np.zeros(shape=(num_samples,), dtype=np.bool)
        logging.info(f"Loading training examples for epoch {epoch}")
        with data_file.open() as f:
            for i, line in enumerate(tqdm(f, total=num_samples, desc="Training examples")):
                line = line.strip()
                example = json.loads(line)
                features = convert_example_to_features(example, tokenizer, seq_len)
                input_ids[i] = features.input_ids
                segment_ids[i] = features.segment_ids
                input_masks[i] = features.input_mask
                lm_label_ids[i] = features.lm_label_ids
                is_nexts[i] = features.is_next
        assert i == num_samples - 1  # Assert that the sample count metric was true
        logging.info("Loading complete!")
        self.num_samples = num_samples
        self.seq_len = seq_len
        self.input_ids = input_ids
        self.input_masks = input_masks
        self.segment_ids = segment_ids
        self.lm_label_ids = lm_label_ids
        self.is_nexts = is_nexts

    def __len__(self):
        return self.num_samples

    def __getitem__(self, item):
        return (torch.tensor(self.input_ids[item].astype(np.int64)),
                torch.tensor(self.input_masks[item].astype(np.int64)),
                torch.tensor(self.segment_ids[item].astype(np.int64)),
                torch.tensor(self.lm_label_ids[item].astype(np.int64)),
                torch.tensor(self.is_nexts[item].astype(np.int64)))


def main():
    parser = ArgumentParser()
    parser.add_argument('--pregenerated_data', type=Path, required=True)
    parser.add_argument('--output_dir', type=Path, required=True)
    parser.add_argument("--bert_model", type=str, required=True, help="Bert pre-trained model selected in the list: bert-base-uncased, "
                             "bert-large-uncased, bert-base-cased, bert-base-multilingual, bert-base-chinese.")
    parser.add_argument("--do_lower_case", action="store_true")
    parser.add_argument("--reduce_memory", action="store_true",
                        help="Store training data as on-disc memmaps to massively reduce memory usage")

    parser.add_argument("--epochs", type=int, default=3, help="Number of epochs to train for")
    parser.add_argument("--local_rank",
                        type=int,
                        default=-1,
                        help="local_rank for distributed training on gpus")
    parser.add_argument("--no_cuda",
                        action='store_true',
                        help="Whether not to use CUDA when available")
    parser.add_argument('--gradient_accumulation_steps',
                        type=int,
                        default=1,
                        help="Number of updates steps to accumulate before performing a backward/update pass.")
    parser.add_argument("--train_batch_size",
                        default=32,
                        type=int,
                        help="Total batch size for training.")
    parser.add_argument('--fp16',
                        action='store_true',
                        help="Whether to use 16-bit float precision instead of 32-bit")
    parser.add_argument('--loss_scale',
                        type=float, default=0,
                        help="Loss scaling to improve fp16 numeric stability. Only used when fp16 set to True.\n"
                        "0 (default value): dynamic loss scaling.\n"
                        "Positive power of 2: static loss scaling value.\n")
    parser.add_argument("--warmup_proportion",
                        default=0.1,
                        type=float,
                        help="Proportion of training to perform linear learning rate warmup for. "
                             "E.g., 0.1 = 10%% of training.")
    parser.add_argument("--learning_rate",
                        default=3e-5,
                        type=float,
                        help="The initial learning rate for Adam.")
    parser.add_argument('--seed',
                        type=int,
                        default=42,
                        help="random seed for initialization")
    args = parser.parse_args()

    assert args.pregenerated_data.is_dir(), \
        "--pregenerated_data should point to the folder of files made by pregenerate_training_data.py!"

    samples_per_epoch = []
    for i in range(args.epochs):
        epoch_file = args.pregenerated_data / f"epoch_{i}.json"
        metrics_file = args.pregenerated_data / f"epoch_{i}_metrics.json"
        if epoch_file.is_file() and metrics_file.is_file():
            metrics = json.loads(metrics_file.read_text())
            samples_per_epoch.append(metrics['num_training_examples'])
        else:
            if i == 0:
                exit("No training data was found!")
            print(f"Warning! There are fewer epochs of pregenerated data ({i}) than training epochs ({args.epochs}).")
            print("This script will loop over the available data, but training diversity may be negatively impacted.")
            num_data_epochs = i
            break
    else:
        num_data_epochs = args.epochs

    if args.local_rank == -1 or args.no_cuda:
        device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
        n_gpu = torch.cuda.device_count()
    else:
        torch.cuda.set_device(args.local_rank)
        device = torch.device("cuda", args.local_rank)
        n_gpu = 1
        # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
        torch.distributed.init_process_group(backend='nccl')
    logging.info("device: {} n_gpu: {}, distributed training: {}, 16-bits training: {}".format(
        device, n_gpu, bool(args.local_rank != -1), args.fp16))

    if args.gradient_accumulation_steps < 1:
        raise ValueError("Invalid gradient_accumulation_steps parameter: {}, should be >= 1".format(
                            args.gradient_accumulation_steps))

    args.train_batch_size = args.train_batch_size // args.gradient_accumulation_steps

    random.seed(args.seed)
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    if n_gpu > 0:
        torch.cuda.manual_seed_all(args.seed)

    if args.output_dir.is_dir() and list(args.output_dir.iterdir()):
        logging.warning(f"Output directory ({args.output_dir}) already exists and is not empty!")
    args.output_dir.mkdir(parents=True, exist_ok=True)

    tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case)

    total_train_examples = 0
    for i in range(args.epochs):
        # The modulo takes into account the fact that we may loop over limited epochs of data
        total_train_examples += samples_per_epoch[i % len(samples_per_epoch)]

    num_train_optimization_steps = int(
        total_train_examples / args.train_batch_size / args.gradient_accumulation_steps)
    if args.local_rank != -1:
        num_train_optimization_steps = num_train_optimization_steps // torch.distributed.get_world_size()

    # Prepare model
    model = BertForPreTraining.from_pretrained(args.bert_model)
    if args.fp16:
        model.half()
    model.to(device)
    if args.local_rank != -1:
        try:
            from apex.parallel import DistributedDataParallel as DDP
        except ImportError:
            raise ImportError(
                "Please install apex from https://www.github.com/nvidia/apex to use distributed and fp16 training.")
        model = DDP(model)
    elif n_gpu > 1:
        model = torch.nn.DataParallel(model)

    # Prepare optimizer
    param_optimizer = list(model.named_parameters())
    no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
    optimizer_grouped_parameters = [
        {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
         'weight_decay': 0.01},
        {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
    ]

    if args.fp16:
        try:
            from apex.optimizers import FP16_Optimizer
            from apex.optimizers import FusedAdam
        except ImportError:
            raise ImportError(
                "Please install apex from https://www.github.com/nvidia/apex to use distributed and fp16 training.")

        optimizer = FusedAdam(optimizer_grouped_parameters,
                              lr=args.learning_rate,
                              bias_correction=False,
                              max_grad_norm=1.0)
        if args.loss_scale == 0:
            optimizer = FP16_Optimizer(optimizer, dynamic_loss_scale=True)
        else:
            optimizer = FP16_Optimizer(optimizer, static_loss_scale=args.loss_scale)
        warmup_linear = WarmupLinearSchedule(warmup=args.warmup_proportion,
                                             t_total=num_train_optimization_steps)
    else:
        optimizer = BertAdam(optimizer_grouped_parameters,
                             lr=args.learning_rate,
                             warmup=args.warmup_proportion,
                             t_total=num_train_optimization_steps)

    global_step = 0
    logging.info("***** Running training *****")
    logging.info(f"  Num examples = {total_train_examples}")
    logging.info("  Batch size = %d", args.train_batch_size)
    logging.info("  Num steps = %d", num_train_optimization_steps)
    model.train()
    for epoch in range(args.epochs):
        epoch_dataset = PregeneratedDataset(epoch=epoch, training_path=args.pregenerated_data, tokenizer=tokenizer,
                                            num_data_epochs=num_data_epochs, reduce_memory=args.reduce_memory)
        if args.local_rank == -1:
            train_sampler = RandomSampler(epoch_dataset)
        else:
            train_sampler = DistributedSampler(epoch_dataset)
        train_dataloader = DataLoader(epoch_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
        tr_loss = 0
        nb_tr_examples, nb_tr_steps = 0, 0
        with tqdm(total=len(train_dataloader), desc=f"Epoch {epoch}") as pbar:
            for step, batch in enumerate(train_dataloader):
                batch = tuple(t.to(device) for t in batch)
                input_ids, input_mask, segment_ids, lm_label_ids, is_next = batch
                loss = model(input_ids, segment_ids, input_mask, lm_label_ids, is_next)
                if n_gpu > 1:
                    loss = loss.mean() # mean() to average on multi-gpu.
                if args.gradient_accumulation_steps > 1:
                    loss = loss / args.gradient_accumulation_steps
                if args.fp16:
                    optimizer.backward(loss)
                else:
                    loss.backward()
                tr_loss += loss.item()
                nb_tr_examples += input_ids.size(0)
                nb_tr_steps += 1
                pbar.update(1)
                mean_loss = tr_loss * args.gradient_accumulation_steps / nb_tr_steps
                pbar.set_postfix_str(f"Loss: {mean_loss:.5f}")
                if (step + 1) % args.gradient_accumulation_steps == 0:
                    if args.fp16:
                        # modify learning rate with special warm up BERT uses
                        # if args.fp16 is False, BertAdam is used that handles this automatically
                        lr_this_step = args.learning_rate * warmup_linear.get_lr(global_step, args.warmup_proportion)
                        for param_group in optimizer.param_groups:
                            param_group['lr'] = lr_this_step
                    optimizer.step()
                    optimizer.zero_grad()
                    global_step += 1

    # Save a trained model
    logging.info("** ** * Saving fine-tuned model ** ** * ")
    model_to_save = model.module if hasattr(model, 'module') else model  # Only save the model it-self
    
    output_model_file = os.path.join(args.output_dir, WEIGHTS_NAME)
    output_config_file = os.path.join(args.output_dir, CONFIG_NAME)

    torch.save(model_to_save.state_dict(), output_model_file)
    model_to_save.config.to_json_file(output_config_file)
    tokenizer.save_vocabulary(args.output_dir)


if __name__ == '__main__':
    main()


================================================
FILE: examples/lm_finetuning/pregenerate_training_data.py
================================================
from argparse import ArgumentParser
from pathlib import Path
from tqdm import tqdm, trange
from tempfile import TemporaryDirectory
import shelve
from multiprocessing import Pool

from random import random, randrange, randint, shuffle, choice
from pytorch_pretrained_bert.tokenization import BertTokenizer
import numpy as np
import json
import collections

class DocumentDatabase:
    def __init__(self, reduce_memory=False):
        if reduce_memory:
            self.temp_dir = TemporaryDirectory()
            self.working_dir = Path(self.temp_dir.name)
            self.document_shelf_filepath = self.working_dir / 'shelf.db'
            self.document_shelf = shelve.open(str(self.document_shelf_filepath),
                                              flag='n', protocol=-1)
            self.documents = None
        else:
            self.documents = []
            self.document_shelf = None
            self.document_shelf_filepath = None
            self.temp_dir = None
        self.doc_lengths = []
        self.doc_cumsum = None
        self.cumsum_max = None
        self.reduce_memory = reduce_memory

    def add_document(self, document):
        if not document:
            return
        if self.reduce_memory:
            current_idx = len(self.doc_lengths)
            self.document_shelf[str(current_idx)] = document
        else:
            self.documents.append(document)
        self.doc_lengths.append(len(document))

    def _precalculate_doc_weights(self):
        self.doc_cumsum = np.cumsum(self.doc_lengths)
        self.cumsum_max = self.doc_cumsum[-1]

    def sample_doc(self, current_idx, sentence_weighted=True):
        # Uses the current iteration counter to ensure we don't sample the same doc twice
        if sentence_weighted:
            # With sentence weighting, we sample docs proportionally to their sentence length
            if self.doc_cumsum is None or len(self.doc_cumsum) != len(self.doc_lengths):
                self._precalculate_doc_weights()
            rand_start = self.doc_cumsum[current_idx]
            rand_end = rand_start + self.cumsum_max - self.doc_lengths[current_idx]
            sentence_index = randrange(rand_start, rand_end) % self.cumsum_max
            sampled_doc_index = np.searchsorted(self.doc_cumsum, sentence_index, side='right')
        else:
            # If we don't use sentence weighting, then every doc has an equal chance to be chosen
            sampled_doc_index = (current_idx + randrange(1, len(self.doc_lengths))) % len(self.doc_lengths)
        assert sampled_doc_index != current_idx
        if self.reduce_memory:
            return self.document_shelf[str(sampled_doc_index)]
        else:
            return self.documents[sampled_doc_index]

    def __len__(self):
        return len(self.doc_lengths)

    def __getitem__(self, item):
        if self.reduce_memory:
            return self.document_shelf[str(item)]
        else:
            return self.documents[item]

    def __enter__(self):
        return self

    def __exit__(self, exc_type, exc_val, traceback):
        if self.document_shelf is not None:
            self.document_shelf.close()
        if self.temp_dir is not None:
            self.temp_dir.cleanup()


def truncate_seq_pair(tokens_a, tokens_b, max_num_tokens):
    """Truncates a pair of sequences to a maximum sequence length. Lifted from Google's BERT repo."""
    while True:
        total_length = len(tokens_a) + len(tokens_b)
        if total_length <= max_num_tokens:
            break

        trunc_tokens = tokens_a if len(tokens_a) > len(tokens_b) else tokens_b
        assert len(trunc_tokens) >= 1

        # We want to sometimes truncate from the front and sometimes from the
        # back to add more randomness and avoid biases.
        if random() < 0.5:
            del trunc_tokens[0]
        else:
            trunc_tokens.pop()

MaskedLmInstance = collections.namedtuple("MaskedLmInstance",
                                          ["index", "label"])

def create_masked_lm_predictions(tokens, masked_lm_prob, max_predictions_per_seq, whole_word_mask, vocab_list):
    """Creates the predictions for the masked LM objective. This is mostly copied from the Google BERT repo, but
    with several refactors to clean it up and remove a lot of unnecessary variables."""
    cand_indices = []
    for (i, token) in enumerate(tokens):
        if token == "[CLS]" or token == "[SEP]":
            continue
        # Whole Word Masking means that if we mask all of the wordpieces
        # corresponding to an original word. When a word has been split into
        # WordPieces, the first token does not have any marker and any subsequence
        # tokens are prefixed with ##. So whenever we see the ## token, we
        # append it to the previous set of word indexes.
        #
        # Note that Whole Word Masking does *not* change the training code
        # at all -- we still predict each WordPiece independently, softmaxed
        # over the entire vocabulary.
        if (whole_word_mask and len(cand_indices) >= 1 and token.startswith("##")):
            cand_indices[-1].append(i)
        else:
            cand_indices.append([i])

    num_to_mask = min(max_predictions_per_seq,
                      max(1, int(round(len(tokens) * masked_lm_prob))))
    shuffle(cand_indices)
    masked_lms = []
    covered_indexes = set()
    for index_set in cand_indices:
        if len(masked_lms) >= num_to_mask:
            break
        # If adding a whole-word mask would exceed the maximum number of
        # predictions, then just skip this candidate.
        if len(masked_lms) + len(index_set) > num_to_mask:
            continue
        is_any_index_covered = False
        for index in index_set:
            if index in covered_indexes:
                is_any_index_covered = True
                break
        if is_any_index_covered:
            continue
        for index in index_set:
            covered_indexes.add(index)

            masked_token = None
            # 80% of the time, replace with [MASK]
            if random() < 0.8:
                masked_token = "[MASK]"
            else:
                # 10% of the time, keep original
                if random() < 0.5:
                    masked_token = tokens[index]
                # 10% of the time, replace with random word
                else:
                    masked_token = choice(vocab_list)
            masked_lms.append(MaskedLmInstance(index=index, label=tokens[index]))
            tokens[index] = masked_token

    assert len(masked_lms) <= num_to_mask
    masked_lms = sorted(masked_lms, key=lambda x: x.index)
    mask_indices = [p.index for p in masked_lms]
    masked_token_labels = [p.label for p in masked_lms]

    return tokens, mask_indices, masked_token_labels


def create_instances_from_document(
        doc_database, doc_idx, max_seq_length, short_seq_prob,
        masked_lm_prob, max_predictions_per_seq, whole_word_mask, vocab_list):
    """This code is mostly a duplicate of the equivalent function from Google BERT's repo.
    However, we make some changes and improvements. Sampling is improved and no longer requires a loop in this function.
    Also, documents are sampled proportionally to the number of sentences they contain, which means each sentence
    (rather than each document) has an equal chance of being sampled as a false example for the NextSentence task."""
    document = doc_database[doc_idx]
    # Account for [CLS], [SEP], [SEP]
    max_num_tokens = max_seq_length - 3

    # We *usually* want to fill up the entire sequence since we are padding
    # to `max_seq_length` anyways, so short sequences are generally wasted
    # computation. However, we *sometimes*
    # (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter
    # sequences to minimize the mismatch between pre-training and fine-tuning.
    # The `target_seq_length` is just a rough target however, whereas
    # `max_seq_length` is a hard limit.
    target_seq_length = max_num_tokens
    if random() < short_seq_prob:
        target_seq_length = randint(2, max_num_tokens)

    # We DON'T just concatenate all of the tokens from a document into a long
    # sequence and choose an arbitrary split point because this would make the
    # next sentence prediction task too easy. Instead, we split the input into
    # segments "A" and "B" based on the actual "sentences" provided by the user
    # input.
    instances = []
    current_chunk = []
    current_length = 0
    i = 0
    while i < len(document):
        segment = document[i]
        current_chunk.append(segment)
        current_length += len(segment)
        if i == len(document) - 1 or current_length >= target_seq_length:
            if current_chunk:
                # `a_end` is how many segments from `current_chunk` go into the `A`
                # (first) sentence.
                a_end = 1
                if len(current_chunk) >= 2:
                    a_end = randrange(1, len(current_chunk))

                tokens_a = []
                for j in range(a_end):
                    tokens_a.extend(current_chunk[j])

                tokens_b = []

                # Random next
                if len(current_chunk) == 1 or random() < 0.5:
                    is_random_next = True
                    target_b_length = target_seq_length - len(tokens_a)

                    # Sample a random document, with longer docs being sampled more frequently
                    random_document = doc_database.sample_doc(current_idx=doc_idx, sentence_weighted=True)

                    random_start = randrange(0, len(random_document))
                    for j in range(random_start, len(random_document)):
                        tokens_b.extend(random_document[j])
                        if len(tokens_b) >= target_b_length:
                            break
                    # We didn't actually use these segments so we "put them back" so
                    # they don't go to waste.
                    num_unused_segments = len(current_chunk) - a_end
                    i -= num_unused_segments
                # Actual next
                else:
                    is_random_next = False
                    for j in range(a_end, len(current_chunk)):
                        tokens_b.extend(current_chunk[j])
                truncate_seq_pair(tokens_a, tokens_b, max_num_tokens)

                assert len(tokens_a) >= 1
                assert len(tokens_b) >= 1

                tokens = ["[CLS]"] + tokens_a + ["[SEP]"] + tokens_b + ["[SEP]"]
                # The segment IDs are 0 for the [CLS] token, the A tokens and the first [SEP]
                # They are 1 for the B tokens and the final [SEP]
                segment_ids = [0 for _ in range(len(tokens_a) + 2)] + [1 for _ in range(len(tokens_b) + 1)]

                tokens, masked_lm_positions, masked_lm_labels = create_masked_lm_predictions(
                    tokens, masked_lm_prob, max_predictions_per_seq, whole_word_mask, vocab_list)

                instance = {
                    "tokens": tokens,
                    "segment_ids": segment_ids,
                    "is_random_next": is_random_next,
                    "masked_lm_positions": masked_lm_positions,
                    "masked_lm_labels": masked_lm_labels}
                instances.append(instance)
            current_chunk = []
            current_length = 0
        i += 1

    return instances


def create_training_file(docs, vocab_list, args, epoch_num):
    epoch_filename = args.output_dir / "epoch_{}.json".format(epoch_num)
    num_instances = 0
    with epoch_filename.open('w') as epoch_file:
        for doc_idx in trange(len(docs), desc="Document"):
            doc_instances = create_instances_from_document(
                docs, doc_idx, max_seq_length=args.max_seq_len, short_seq_prob=args.short_seq_prob,
                masked_lm_prob=args.masked_lm_prob, max_predictions_per_seq=args.max_predictions_per_seq,
                whole_word_mask=args.do_whole_word_mask, vocab_list=vocab_list)
            doc_instances = [json.dumps(instance) for instance in doc_instances]
            for instance in doc_instances:
                epoch_file.write(instance + '\n')
                num_instances += 1
    metrics_file = args.output_dir / "epoch_{}_metrics.json".format(epoch_num)
    with metrics_file.open('w') as metrics_file:
        metrics = {
            "num_training_examples": num_instances,
            "max_seq_len": args.max_seq_len
        }
        metrics_file.write(json.dumps(metrics))


def main():
    parser = ArgumentParser()
    parser.add_argument('--train_corpus', type=Path, required=True)
    parser.add_argument("--output_dir", type=Path, required=True)
    parser.add_argument("--bert_model", type=str, required=True,
                        choices=["bert-base-uncased", "bert-large-uncased", "bert-base-cased",
                                 "bert-base-multilingual-uncased", "bert-base-chinese", "bert-base-multilingual-cased"])
    parser.add_argument("--do_lower_case", action="store_true")
    parser.add_argument("--do_whole_word_mask", action="store_true",
                        help="Whether to use whole word masking rather than per-WordPiece masking.")
    parser.add_argument("--reduce_memory", action="store_true",
                        help="Reduce memory usage for large datasets by keeping data on disc rather than in memory")

    parser.add_argument("--num_workers", type=int, default=1,
                        help="The number of workers to use to write the files")
    parser.add_argument("--epochs_to_generate", type=int, default=3,
                        help="Number of epochs of data to pregenerate")
    parser.add_argument("--max_seq_len", type=int, default=128)
    parser.add_argument("--short_seq_prob", type=float, default=0.1,
                        help="Probability of making a short sentence as a training example")
    parser.add_argument("--masked_lm_prob", type=float, default=0.15,
                        help="Probability of masking each token for the LM task")
    parser.add_argument("--max_predictions_per_seq", type=int, default=20,
                        help="Maximum number of tokens to mask in each sequence")

    args = parser.parse_args()

    if args.num_workers > 1 and args.reduce_memory:
        raise ValueError("Cannot use multiple workers while reducing memory")

    tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case)
    vocab_list = list(tokenizer.vocab.keys())
    with DocumentDatabase(reduce_memory=args.reduce_memory) as docs:
        with args.train_corpus.open() as f:
            doc = []
            for line in tqdm(f, desc="Loading Dataset", unit=" lines"):
                line = line.strip()
                if line == "":
                    docs.add_document(doc)
                    doc = []
                else:
                    tokens = tokenizer.tokenize(line)
                    doc.append(tokens)
            if doc:
                docs.add_document(doc)  # If the last doc didn't end on a newline, make sure it still gets added
        if len(docs) <= 1:
            exit("ERROR: No document breaks were found in the input file! These are necessary to allow the script to "
                 "ensure that random NextSentences are not sampled from the same document. Please add blank lines to "
                 "indicate breaks between documents in your input file. If your dataset does not contain multiple "
                 "documents, blank lines can be inserted at any natural boundary, such as the ends of chapters, "
                 "sections or paragraphs.")

        args.output_dir.mkdir(exist_ok=True)

        if args.num_workers > 1:
            writer_workers = Pool(min(args.num_workers, args.epochs_to_generate))
            arguments = [(docs, vocab_list, args, idx) for idx in range(args.epochs_to_generate)]
            writer_workers.starmap(create_training_file, arguments)
        else:
            for epoch in trange(args.epochs_to_generate, desc="Epoch"):
                create_training_file(docs, vocab_list, args, epoch)


if __name__ == '__main__':
    main()


================================================
FILE: examples/lm_finetuning/simple_lm_finetuning.py
================================================
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""BERT finetuning runner."""

from __future__ import absolute_import, division, print_function, unicode_literals

import argparse
import logging
import os
import random
from io import open

import numpy as np
import torch
from torch.utils.data import DataLoader, Dataset, RandomSampler
from torch.utils.data.distributed import DistributedSampler
from tqdm import tqdm, trange

from pytorch_pretrained_bert import WEIGHTS_NAME, CONFIG_NAME
from pytorch_pretrained_bert.modeling import BertForPreTraining
from pytorch_pretrained_bert.tokenization import BertTokenizer
from pytorch_pretrained_bert.optimization import BertAdam, WarmupLinearSchedule

logging.basicConfig(format='%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
                    datefmt='%m/%d/%Y %H:%M:%S',
                    level=logging.INFO)
logger = logging.getLogger(__name__)


class BERTDataset(Dataset):
    def __init__(self, corpus_path, tokenizer, seq_len, encoding="utf-8", corpus_lines=None, on_memory=True):
        self.vocab = tokenizer.vocab
        self.tokenizer = tokenizer
        self.seq_len = seq_len
        self.on_memory = on_memory
        self.corpus_lines = corpus_lines  # number of non-empty lines in input corpus
        self.corpus_path = corpus_path
        self.encoding = encoding
        self.current_doc = 0  # to avoid random sentence from same doc

        # for loading samples directly from file
        self.sample_counter = 0  # used to keep track of full epochs on file
        self.line_buffer = None  # keep second sentence of a pair in memory and use as first sentence in next pair

        # for loading samples in memory
        self.current_random_doc = 0
        self.num_docs = 0
        self.sample_to_doc = [] # map sample index to doc and line

        # load samples into memory
        if on_memory:
            self.all_docs = []
            doc = []
            self.corpus_lines = 0
            with open(corpus_path, "r", encoding=encoding) as f:
                for line in tqdm(f, desc="Loading Dataset", total=corpus_lines):
                    line = line.strip()
                    if line == "":
                        self.all_docs.append(doc)
                        doc = []
                        #remove last added sample because there won't be a subsequent line anymore in the doc
                        self.sample_to_doc.pop()
                    else:
                        #store as one sample
                        sample = {"doc_id": len(self.all_docs),
                                  "line": len(doc)}
                        self.sample_to_doc.append(sample)
                        doc.append(line)
                        self.corpus_lines = self.corpus_lines + 1

            # if last row in file is not empty
            if self.all_docs[-1] != doc:
                self.all_docs.append(doc)
                self.sample_to_doc.pop()

            self.num_docs = len(self.all_docs)

        # load samples later lazily from disk
        else:
            if self.corpus_lines is None:
                with open(corpus_path, "r", encoding=encoding) as f:
                    self.corpus_lines = 0
                    for line in tqdm(f, desc="Loading Dataset", total=corpus_lines):
                        if line.strip() == "":
                            self.num_docs += 1
                        else:
                            self.corpus_lines += 1

                    # if doc does not end with empty line
                    if line.strip() != "":
                        self.num_docs += 1

            self.file = open(corpus_path, "r", encoding=encoding)
            self.random_file = open(corpus_path, "r", encoding=encoding)

    def __len__(self):
        # last line of doc won't be used, because there's no "nextSentence". Additionally, we start counting at 0.
        return self.corpus_lines - self.num_docs - 1

    def __getitem__(self, item):
        cur_id = self.sample_counter
        self.sample_counter += 1
        if not self.on_memory:
            # after one epoch we start again from beginning of file
            if cur_id != 0 and (cur_id % len(self) == 0):
                self.file.close()
                self.file = open(self.corpus_path, "r", encoding=self.encoding)

        t1, t2, is_next_label = self.random_sent(item)

        # tokenize
        tokens_a = self.tokenizer.tokenize(t1)
        tokens_b = self.tokenizer.tokenize(t2)

        # combine to one sample
        cur_example = InputExample(guid=cur_id, tokens_a=tokens_a, tokens_b=tokens_b, is_next=is_next_label)

        # transform sample to features
        cur_features = convert_example_to_features(cur_example, self.seq_len, self.tokenizer)

        cur_tensors = (torch.tensor(cur_features.input_ids),
                       torch.tensor(cur_features.input_mask),
                       torch.tensor(cur_features.segment_ids),
                       torch.tensor(cur_features.lm_label_ids),
                       torch.tensor(cur_features.is_next))

        return cur_tensors

    def random_sent(self, index):
        """
        Get one sample from corpus consisting of two sentences. With prob. 50% these are two subsequent sentences
        from one doc. With 50% the second sentence will be a random one from another doc.
        :param index: int, index of sample.
        :return: (str, str, int), sentence 1, sentence 2, isNextSentence Label
        """
        t1, t2 = self.get_corpus_line(index)
        if random.random() > 0.5:
            label = 0
        else:
            t2 = self.get_random_line()
            label = 1

        assert len(t1) > 0
        assert len(t2) > 0
        return t1, t2, label

    def get_corpus_line(self, item):
        """
        Get one sample from corpus consisting of a pair of two subsequent lines from the same doc.
        :param item: int, index of sample.
        :return: (str, str), two subsequent sentences from corpus
        """
        t1 = ""
        t2 = ""
        assert item < self.corpus_lines
        if self.on_memory:
            sample = self.sample_to_doc[item]
            t1 = self.all_docs[sample["doc_id"]][sample["line"]]
            t2 = self.all_docs[sample["doc_id"]][sample["line"]+1]
            # used later to avoid random nextSentence from same doc
            self.current_doc = sample["doc_id"]
            return t1, t2
        else:
            if self.line_buffer is None:
                # read first non-empty line of file
                while t1 == "" :
                    t1 = next(self.file).strip()
                    t2 = next(self.file).strip()
            else:
                # use t2 from previous iteration as new t1
                t1 = self.line_buffer
                t2 = next(self.file).strip()
                # skip empty rows that are used for separating documents and keep track of current doc id
                while t2 == "" or t1 == "":
                    t1 = next(self.file).strip()
                    t2 = next(self.file).strip()
                    self.current_doc = self.current_doc+1
            self.line_buffer = t2

        assert t1 != ""
        assert t2 != ""
        return t1, t2

    def get_random_line(self):
        """
        Get random line from another document for nextSentence task.
        :return: str, content of one line
        """
        # Similar to original tf repo: This outer loop should rarely go for more than one iteration for large
        # corpora. However, just to be careful, we try to make sure that
        # the random document is not the same as the document we're processing.
        for _ in range(10):
            if self.on_memory:
                rand_doc_idx = random.randint(0, len(self.all_docs)-1)
                rand_doc = self.all_docs[rand_doc_idx]
                line = rand_doc[random.randrange(len(rand_doc))]
            else:
                rand_index = random.randint(1, self.corpus_lines if self.corpus_lines < 1000 else 1000)
                #pick random line
                for _ in range(rand_index):
                    line = self.get_next_line()
            #check if our picked random line is really from another doc like we want it to be
            if self.current_random_doc != self.current_doc:
                break
        return line

    def get_next_line(self):
        """ Gets next line of random_file and starts over when reaching end of file"""
        try:
            line = next(self.random_file).strip()
            #keep track of which document we are currently looking at to later avoid having the same doc as t1
            if line == "":
                self.current_random_doc = self.current_random_doc + 1
                line = next(self.random_file).strip()
        except StopIteration:
            self.random_file.close()
            self.random_file = open(self.corpus_path, "r", encoding=self.encoding)
            line = next(self.random_file).strip()
        return line


class InputExample(object):
    """A single training/test example for the language model."""

    def __init__(self, guid, tokens_a, tokens_b=None, is_next=None, lm_labels=None):
        """Constructs a InputExample.

        Args:
            guid: Unique id for the example.
            tokens_a: string. The untokenized text of the first sequence. For single
            sequence tasks, only this sequence must be specified.
            tokens_b: (Optional) string. The untokenized text of the second sequence.
            Only must be specified for sequence pair tasks.
            label: (Optional) string. The label of the example. This should be
            specified for train and dev examples, but not for test examples.
        """
        self.guid = guid
        self.tokens_a = tokens_a
        self.tokens_b = tokens_b
        self.is_next = is_next  # nextSentence
        self.lm_labels = lm_labels  # masked words for language model


class InputFeatures(object):
    """A single set of features of data."""

    def __init__(self, input_ids, input_mask, segment_ids, is_next, lm_label_ids):
        self.input_ids = input_ids
        self.input_mask = input_mask
        self.segment_ids = segment_ids
        self.is_next = is_next
        self.lm_label_ids = lm_label_ids


def random_word(tokens, tokenizer):
    """
    Masking some random tokens for Language Model task with probabilities as in the original BERT paper.
    :param tokens: list of str, tokenized sentence.
    :param tokenizer: Tokenizer, object used for tokenization (we need it's vocab here)
    :return: (list of str, list of int), masked tokens and related labels for LM prediction
    """
    output_label = []

    for i, token in enumerate(tokens):
        prob = random.random()
        # mask token with 15% probability
        if prob < 0.15:
            prob /= 0.15

            # 80% randomly change token to mask token
            if prob < 0.8:
                tokens[i] = "[MASK]"

            # 10% randomly change token to random token
            elif prob < 0.9:
                tokens[i] = random.choice(list(tokenizer.vocab.items()))[0]

            # -> rest 10% randomly keep current token

            # append current token to output (we will predict these later)
            try:
                output_label.append(tokenizer.vocab[token])
            except KeyError:
                # For unknown words (should not occur with BPE vocab)
                output_label.append(tokenizer.vocab["[UNK]"])
                logger.warning("Cannot find token '{}' in vocab. Using [UNK] insetad".format(token))
        else:
            # no masking token (will be ignored by loss function later)
            output_label.append(-1)

    return tokens, output_label


def convert_example_to_features(example, max_seq_length, tokenizer):
    """
    Convert a raw sample (pair of sentences as tokenized strings) into a proper training sample with
    IDs, LM labels, input_mask, CLS and SEP tokens etc.
    :param example: InputExample, containing sentence input as strings and is_next label
    :param max_seq_length: int, maximum length of sequence.
    :param tokenizer: Tokenizer
    :return: InputFeatures, containing all inputs and labels of one sample as IDs (as used for model training)
    """
    tokens_a = example.tokens_a
    tokens_b = example.tokens_b
    # Modifies `tokens_a` and `tokens_b` in place so that the total
    # length is less than the specified length.
    # Account for [CLS], [SEP], [SEP] with "- 3"
    _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)

    tokens_a, t1_label = random_word(tokens_a, tokenizer)
    tokens_b, t2_label = random_word(tokens_b, tokenizer)
    # concatenate lm labels and account for CLS, SEP, SEP
    lm_label_ids = ([-1] + t1_label + [-1] + t2_label + [-1])

    # The convention in BERT is:
    # (a) For sequence pairs:
    #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
    #  type_ids: 0   0  0    0    0     0       0 0    1  1  1  1   1 1
    # (b) For single sequences:
    #  tokens:   [CLS] the dog is hairy . [SEP]
    #  type_ids: 0   0   0   0  0     0 0
    #
    # Where "type_ids" are used to indicate whether this is the first
    # sequence or the second sequence. The embedding vectors for `type=0` and
    # `type=1` were learned during pre-training and are added to the wordpiece
    # embedding vector (and position vector). This is not *strictly* necessary
    # since the [SEP] token unambigiously separates the sequences, but it makes
    # it easier for the model to learn the concept of sequences.
    #
    # For classification tasks, the first vector (corresponding to [CLS]) is
    # used as as the "sentence vector". Note that this only makes sense because
    # the entire model is fine-tuned.
    tokens = []
    segment_ids = []
    tokens.append("[CLS]")
    segment_ids.append(0)
    for token in tokens_a:
        tokens.append(token)
        segment_ids.append(0)
    tokens.append("[SEP]")
    segment_ids.append(0)

    assert len(tokens_b) > 0
    for token in tokens_b:
        tokens.append(token)
        segment_ids.append(1)
    tokens.append("[SEP]")
    segment_ids.append(1)

    input_ids = tokenizer.convert_tokens_to_ids(tokens)

    # The mask has 1 for real tokens and 0 for padding tokens. Only real
    # tokens are attended to.
    input_mask = [1] * len(input_ids)

    # Zero-pad up to the sequence length.
    while len(input_ids) < max_seq_length:
        input_ids.append(0)
        input_mask.append(0)
        segment_ids.append(0)
        lm_label_ids.append(-1)

    assert len(input_ids) == max_seq_length
    assert len(input_mask) == max_seq_length
    assert len(segment_ids) == max_seq_length
    assert len(lm_label_ids) == max_seq_length

    if example.guid < 5:
        logger.info("*** Example ***")
        logger.info("guid: %s" % (example.guid))
        logger.info("tokens: %s" % " ".join(
                [str(x) for x in tokens]))
        logger.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
        logger.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
        logger.info(
                "segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
        logger.info("LM label: %s " % (lm_label_ids))
        logger.info("Is next sentence label: %s " % (example.is_next))

    features = InputFeatures(input_ids=input_ids,
                             input_mask=input_mask,
                             segment_ids=segment_ids,
                             lm_label_ids=lm_label_ids,
                             is_next=example.is_next)
    return features


def main():
    parser = argparse.ArgumentParser()

    ## Required parameters
    parser.add_argument("--train_corpus",
                        default=None,
                        type=str,
                        required=True,
                        help="The input train corpus.")
    parser.add_argument("--bert_model", default=None, type=str, required=True,
                        help="Bert pre-trained model selected in the list: bert-base-uncased, "
                             "bert-large-uncased, bert-base-cased, bert-base-multilingual, bert-base-chinese.")
    parser.add_argument("--output_dir",
                        default=None,
                        type=str,
                        required=True,
                        help="The output directory where the model checkpoints will be written.")

    ## Other parameters
    parser.add_argument("--max_seq_length",
                        default=128,
                        type=int,
                        help="The maximum total input sequence length after WordPiece tokenization. \n"
                             "Sequences longer than this will be truncated, and sequences shorter \n"
                             "than this will be padded.")
    parser.add_argument("--do_train",
                        action='store_true',
                        help="Whether to run training.")
    parser.add_argument("--train_batch_size",
                        default=32,
                        type=int,
                        help="Total batch size for training.")
    parser.add_argument("--learning_rate",
                        default=3e-5,
                        type=float,
                        help="The initial learning rate for Adam.")
    parser.add_argument("--num_train_epochs",
                        default=3.0,
                        type=float,
                        help="Total number of training epochs to perform.")
    parser.add_argument("--warmup_proportion",
                        default=0.1,
                        type=float,
                        help="Proportion of training to perform linear learning rate warmup for. "
                             "E.g., 0.1 = 10%% of training.")
    parser.add_argument("--no_cuda",
                        action='store_true',
                        help="Whether not to use CUDA when available")
    parser.add_argument("--on_memory",
                        action='store_true',
                        help="Whether to load train samples into memory or use disk")
    parser.add_argument("--do_lower_case",
                        action='store_true',
                        help="Whether to lower case the input text. True for uncased models, False for cased models.")
    parser.add_argument("--local_rank",
                        type=int,
                        default=-1,
                        help="local_rank for distributed training on gpus")
    parser.add_argument('--seed',
                        type=int,
                        default=42,
                        help="random seed for initialization")
    parser.add_argument('--gradient_accumulation_steps',
                        type=int,
                        default=1,
                        help="Number of updates steps to accumualte before performing a backward/update pass.")
    parser.add_argument('--fp16',
                        action='store_true',
                        help="Whether to use 16-bit float precision instead of 32-bit")
    parser.add_argument('--loss_scale',
                        type = float, default = 0,
                        help = "Loss scaling to improve fp16 numeric stability. Only used when fp16 set to True.\n"
                        "0 (default value): dynamic loss scaling.\n"
                        "Positive power of 2: static loss scaling value.\n")

    args = parser.parse_args()

    if args.local_rank == -1 or args.no_cuda:
        device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
        n_gpu = torch.cuda.device_count()
    else:
        torch.cuda.set_device(args.local_rank)
        device = torch.device("cuda", args.local_rank)
        n_gpu = 1
        # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
        torch.distributed.init_process_group(backend='nccl')
    logger.info("device: {} n_gpu: {}, distributed training: {}, 16-bits training: {}".format(
        device, n_gpu, bool(args.local_rank != -1), args.fp16))

    if args.gradient_accumulation_steps < 1:
        raise ValueError("Invalid gradient_accumulation_steps parameter: {}, should be >= 1".format(
                            args.gradient_accumulation_steps))

    args.train_batch_size = args.train_batch_size // args.gradient_accumulation_steps

    random.seed(args.seed)
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    if n_gpu > 0:
        torch.cuda.manual_seed_all(args.seed)

    if not args.do_train:
        raise ValueError("Training is currently the only implemented execution option. Please set `do_train`.")

    if os.path.exists(args.output_dir) and os.listdir(args.output_dir):
        raise ValueError("Output directory ({}) already exists and is not empty.".format(args.output_dir))
    if not os.path.exists(args.output_dir):
        os.makedirs(args.output_dir)

    tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case)

    #train_examples = None
    num_train_optimization_steps = None
    if args.do_train:
        print("Loading Train Dataset", args.train_corpus)
        train_dataset = BERTDataset(args.train_corpus, tokenizer, seq_len=args.max_seq_length,
                                    corpus_lines=None, on_memory=args.on_memory)
        num_train_optimization_steps = int(
            len(train_dataset) / args.train_batch_size / args.gradient_accumulation_steps) * args.num_train_epochs
        if args.local_rank != -1:
            num_train_optimization_steps = num_train_optimization_steps // torch.distributed.get_world_size()

    # Prepare model
    model = BertForPreTraining.from_pretrained(args.bert_model)
    if args.fp16:
        model.half()
    model.to(device)
    if args.local_rank != -1:
        try:
            from apex.parallel import DistributedDataParallel as DDP
        except ImportError:
            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use distributed and fp16 training.")
        model = DDP(model)
    elif n_gpu > 1:
        model = torch.nn.DataParallel(model)

    # Prepare optimizer
    if args.do_train:
        param_optimizer = list(model.named_parameters())
        no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
        optimizer_grouped_parameters = [
            {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
            {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
            ]

        if args.fp16:
            try:
                from apex.optimizers import FP16_Optimizer
                from apex.optimizers import FusedAdam
            except ImportError:
                raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use distributed and fp16 training.")

            optimizer = FusedAdam(optimizer_grouped_parameters,
                                  lr=args.learning_rate,
                                  bias_correction=False,
                                  max_grad_norm=1.0)
            if args.loss_scale == 0:
                optimizer = FP16_Optimizer(optimizer, dynamic_loss_scale=True)
            else:
                optimizer = FP16_Optimizer(optimizer, static_loss_scale=args.loss_scale)
            warmup_linear = WarmupLinearSchedule(warmup=args.warmup_proportion,
                                                 t_total=num_train_optimization_steps)

        else:
            optimizer = BertAdam(optimizer_grouped_parameters,
                                 lr=args.learning_rate,
                                 warmup=args.warmup_proportion,
                                 t_total=num_train_optimization_steps)

    global_step = 0
    if args.do_train:
        logger.info("***** Running training *****")
        logger.info("  Num examples = %d", len(train_dataset))
        logger.info("  Batch size = %d", args.train_batch_size)
        logger.info("  Num steps = %d", num_train_optimization_steps)

        if args.local_rank == -1:
            train_sampler = RandomSampler(train_dataset)
        else:
            #TODO: check if this works with current data generator from disk that relies on next(file)
            # (it doesn't return item back by index)
            train_sampler = DistributedSampler(train_dataset)
        train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)

        model.train()
        for _ in trange(int(args.num_train_epochs), desc="Epoch"):
            tr_loss = 0
            nb_tr_examples, nb_tr_steps = 0, 0
            for step, batch in enumerate(tqdm(train_dataloader, desc="Iteration")):
                batch = tuple(t.to(device) for t in batch)
                input_ids, input_mask, segment_ids, lm_label_ids, is_next = batch
                loss = model(input_ids, segment_ids, input_mask, lm_label_ids, is_next)
                if n_gpu > 1:
                    loss = loss.mean() # mean() to average on multi-gpu.
                if args.gradient_accumulation_steps > 1:
                    loss = loss / args.gradient_accumulation_steps
                if args.fp16:
                    optimizer.backward(loss)
                else:
                    loss.backward()
                tr_loss += loss.item()
                nb_tr_examples += input_ids.size(0)
                nb_tr_steps += 1
                if (step + 1) % args.gradient_accumulation_steps == 0:
                    if args.fp16:
                        # modify learning rate with special warm up BERT uses
                        # if args.fp16 is False, BertAdam is used that handles this automatically
                        lr_this_step = args.learning_rate * warmup_linear.get_lr(global_step, args.warmup_proportion)
                        for param_group in optimizer.param_groups:
                            param_group['lr'] = lr_this_step
                    optimizer.step()
                    optimizer.zero_grad()
                    global_step += 1

        # Save a trained model
        logger.info("** ** * Saving fine - tuned model ** ** * ")
        model_to_save = model.module if hasattr(model, 'module') else model  # Only save the model it-self
        output_model_file = os.path.join(args.output_dir, WEIGHTS_NAME)
        output_config_file = os.path.join(args.output_dir, CONFIG_NAME)
        if args.do_train:
            torch.save(model_to_save.state_dict(), output_model_file)
            model_to_save.config.to_json_file(output_config_file)
            tokenizer.save_vocabulary(args.output_dir)


def _truncate_seq_pair(tokens_a, tokens_b, max_length):
    """Truncates a sequence pair in place to the maximum length."""

    # This is a simple heuristic which will always truncate the longer sequence
    # one token at a time. This makes more sense than truncating an equal percent
    # of tokens from each, since if one sequence is very short then each token
    # that's truncated likely contains more information than a longer sequence.
    while True:
        total_length = len(tokens_a) + len(tokens_b)
        if total_length <= max_length:
            break
        if len(tokens_a) > len(tokens_b):
            tokens_a.pop()
        else:
            tokens_b.pop()


def accuracy(out, labels):
    outputs = np.argmax(out, axis=1)
    return np.sum(outputs == labels)


if __name__ == "__main__":
    main()


================================================
FILE: examples/run_classifier.py
================================================
#coding=utf-8
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""BERT finetuning runner."""

from __future__ import absolute_import, division, print_function

import argparse
import csv
import logging
import os
import random

import sys
sys.path.append('..')

import copy

import numpy as np
import torch
from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
                              TensorDataset)
from torch.utils.data.distributed import DistributedSampler
from tqdm import tqdm, trange

from torch.nn import CrossEntropyLoss, MSELoss
from scipy.stats import pearsonr, spearmanr
from sklearn.metrics import matthews_corrcoef, f1_score, classification_report


from pytorch_pretrained_bert.file_utils import PYTORCH_PRETRAINED_BERT_CACHE, WEIGHTS_NAME, CONFIG_NAME
from pytorch_pretrained_bert.modeling import BertForSequenceClassification, BertConfig
from pytorch_pretrained_bert.tokenization import BertTokenizer
from pytorch_pretrained_bert.optimization import BertAdam, WarmupLinearSchedule

logger = logging.getLogger(__name__)


class InputExample(object):
    """A single training/test example for simple sequence classification."""

    def __init__(self, guid, text_a, text_b=None, label=None, entity_pos=None):
        """Constructs a InputExample.

        Args:
            guid: Unique id for the example.
            text_a: string. The untokenized text of the first sequence. For single
            sequence tasks, only this sequence must be specified.
            text_b: (Optional) string. The untokenized text of the second sequence.
            Only must be specified for sequence pair tasks.
            label: (Optional) string. The label of the example. This should be
            specified for train and dev examples, but not for test examples.
        """
        self.guid = guid
        self.text_a = text_a
        self.text_b = text_b
        self.label = label
        self.entity_pos = entity_pos

class InputFeatures(object):
    """A single set of features of data."""

    def __init__(self, input_ids, input_mask, segment_ids, label_id, entity_mask=None, entity_seg_pos=None, entity_span1_pos=None, entity_span2_pos=None):
        self.input_ids = input_ids
        self.input_mask = input_mask
        self.segment_ids = segment_ids
        self.label_id = label_id
        self.entity_mask = entity_mask
        self.entity_seg_pos = entity_seg_pos
        self.entity_span1_pos = entity_span1_pos
        self.entity_span2_pos = entity_span2_pos


class DataProcessor(object):
    """Base class for data converters for sequence classification data sets."""

    def get_train_examples(self, data_dir):
        """Gets a collection of `InputExample`s for the train set."""
        raise NotImplementedError()

    def get_dev_examples(self, data_dir):
        """Gets a collection of `InputExample`s for the dev set."""
        raise NotImplementedError()

    def get_labels(self):
        """Gets the list of labels for this data set."""
        raise NotImplementedError()

    @classmethod
    def _read_tsv(cls, input_file, quotechar=None):
        """Reads a tab separated value file."""
        with open(input_file, "r", encoding="utf-8") as f:
            reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
            lines = []
            for line in reader:
                if sys.version_info[0] == 2:
                    line = list(unicode(cell, 'utf-8') for cell in line)
                lines.append(line)
            return lines


class MrpcProcessor(DataProcessor):
    """Processor for the MRPC data set (GLUE version)."""

    def get_train_examples(self, data_dir):
        """See base class."""
        logger.info("LOOKING AT {}".format(os.path.join(data_dir, "train.tsv")))
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")

    def get_dev_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")

    def get_labels(self):
        """See base class."""
        return ["0", "1"]

    def _create_examples(self, lines, set_type):
        """Creates examples for the training and dev sets."""
        examples = []
        for (i, line) in enumerate(lines):
            if i == 0:
                continue
            guid = "%s-%s" % (set_type, i)
            text_a = line[3]
            text_b = line[4]
            label = line[0]
            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
        return examples

class SemProcessor(DataProcessor):
    """Processor for the SemEval 2010 Task 8 dataset."""

    def get_train_examples(self, data_dir):
        """See base class."""
        logger.info("LOOKING AT {}".format(os.path.join(data_dir, "train.jsonl")))
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "train.jsonl")), "train")

    def get_dev_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "test.jsonl")), "dev")

    def get_labels(self):
        """See base class."""
        return ['Message-Topic(e2,e1)', 'Instrument-Agency(e2,e1)', 'Entity-Origin(e2,e1)', 'Member-Collection(e1,e2)', 'Member-Collection(e2,e1)', 'Other', 'Component-Whole(e1,e2)', 'Product-Producer(e2,e1)', 'Component-Whole(e2,e1)', 'Entity-Destination(e2,e1)', 'Content-Container(e2,e1)', 'Entity-Destination(e1,e2)', 'Instrument-Agency(e1,e2)', 'Cause-Effect(e2,e1)', 'Entity-Origin(e1,e2)', 'Product-Producer(e1,e2)', 'Cause-Effect(e1,e2)', 'Message-Topic(e1,e2)', 'Content-Container(e1,e2)']

    def _create_examples(self, lines, set_type):
        """Creates examples for the training and dev sets."""
        import json
        examples = []
        for (i, line) in enumerate(lines):
            guid = "%s-%s" % (set_type, i)
            line = json.loads(line[0])
            text_a = ' '.join(line['tokens'])
            label = line['label']
            entity_pos = line['entities']
            examples.append(
                InputExample(guid=guid, text_a=text_a, label=label, entity_pos = entity_pos))
        return examples


class MnliProcessor(DataProcessor):
    """Processor for the MultiNLI data set (GLUE version)."""

    def get_train_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")

    def get_dev_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "dev_matched.tsv")),
            "dev_matched")

    def get_labels(self):
        """See base class."""
        return ["contradiction", "entailment", "neutral"]

    def _create_examples(self, lines, set_type):
        """Creates examples for the training and dev sets."""
        examples = []
        for (i, line) in enumerate(lines):
            if i == 0:
                continue
            guid = "%s-%s" % (set_type, line[0])
            text_a = line[8]
            text_b = line[9]
            label = line[-1]
            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
        return examples


class MnliMismatchedProcessor(MnliProcessor):
    """Processor for the MultiNLI Mismatched data set (GLUE version)."""

    def get_dev_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "dev_mismatched.tsv")),
            "dev_matched")


class ColaProcessor(DataProcessor):
    """Processor for the CoLA data set (GLUE version)."""

    def get_train_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")

    def get_dev_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")

    def get_labels(self):
        """See base class."""
        return ["0", "1"]

    def _create_examples(self, lines, set_type):
        """Creates examples for the training and dev sets."""
        examples = []
        for (i, line) in enumerate(lines):
            guid = "%s-%s" % (set_type, i)
            text_a = line[3]
            label = line[1]
            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
        return examples


class Sst2Processor(DataProcessor):
    """Processor for the SST-2 data set (GLUE version)."""

    def get_train_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")

    def get_dev_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")

    def get_labels(self):
        """See base class."""
        return ["0", "1"]

    def _create_examples(self, lines, set_type):
        """Creates examples for the training and dev sets."""
        examples = []
        for (i, line) in enumerate(lines):
            if i == 0:
                continue
            guid = "%s-%s" % (set_type, i)
            text_a = line[0]
            label = line[1]
            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
        return examples


class StsbProcessor(DataProcessor):
    """Processor for the STS-B data set (GLUE version)."""

    def get_train_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")

    def get_dev_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")

    def get_labels(self):
        """See base class."""
        return [None]

    def _create_examples(self, lines, set_type):
        """Creates examples for the training and dev sets."""
        examples = []
        for (i, line) in enumerate(lines):
            if i == 0:
                continue
            guid = "%s-%s" % (set_type, line[0])
            text_a = line[7]
            text_b = line[8]
            label = line[-1]
            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
        return examples


class QqpProcessor(DataProcessor):
    """Processor for the QQP data set (GLUE version)."""

    def get_train_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")

    def get_dev_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")

    def get_labels(self):
        """See base class."""
        return ["0", "1"]

    def _create_examples(self, lines, set_type):
        """Creates examples for the training and dev sets."""
        examples = []
        for (i, line) in enumerate(lines):
            if i == 0:
                continue
            guid = "%s-%s" % (set_type, line[0])
            try:
                text_a = line[3]
                text_b = line[4]
                label = line[5]
            except IndexError:
                continue
            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
        return examples


class QnliProcessor(DataProcessor):
    """Processor for the QNLI data set (GLUE version)."""

    def get_train_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")

    def get_dev_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "dev.tsv")), 
            "dev_matched")

    def get_labels(self):
        """See base class."""
        return ["entailment", "not_entailment"]

    def _create_examples(self, lines, set_type):
        """Creates examples for the training and dev sets."""
        examples = []
        for (i, line) in enumerate(lines):
            if i == 0:
                continue
            guid = "%s-%s" % (set_type, line[0])
            text_a = line[1]
            text_b = line[2]
            label = line[-1]
            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
        return examples


class RteProcessor(DataProcessor):
    """Processor for the RTE data set (GLUE version)."""

    def get_train_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")

    def get_dev_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")

    def get_labels(self):
        """See base class."""
        return ["entailment", "not_entailment"]

    def _create_examples(self, lines, set_type):
        """Creates examples for the training and dev sets."""
        examples = []
        for (i, line) in enumerate(lines):
            if i == 0:
                continue
            guid = "%s-%s" % (set_type, line[0])
            text_a = line[1]
            text_b = line[2]
            label = line[-1]
            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
        return examples


class WnliProcessor(DataProcessor):
    """Processor for the WNLI data set (GLUE version)."""

    def get_train_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")

    def get_dev_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")

    def get_labels(self):
        """See base class."""
        return ["0", "1"]

    def _create_examples(self, lines, set_type):
        """Creates examples for the training and dev sets."""
        examples = []
        for (i, line) in enumerate(lines):
            if i == 0:
                continue
            guid = "%s-%s" % (set_type, line[0])
            text_a = line[1]
            text_b = line[2]
            label = line[-1]
            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
        return examples


def convert_examples_to_features(examples, label_list, max_seq_length,
                                 tokenizer, output_mode):
    """Loads a data file into a list of `InputBatch`s."""

    label_map = {label : i for i, label in enumerate(label_list)}
    features = []
    for (ex_index, example) in enumerate(examples):
        if ex_index % 10000 == 0:
            logger.info("Writing example %d of %d" % (ex_index, len(examples)))
        old_entity_pos = copy.deepcopy(example.entity_pos)
        tokens_a, new_entity_pos = tokenizer.tokenize(example.text_a,example.entity_pos)
        
        old_entity0 = ''.join(example.text_a.split()[old_entity_pos[0][0]:old_entity_pos[0][1]])
        old_entity1 = ''.join(example.text_a.split()[old_entity_pos[1][0]:old_entity_pos[1][1]])
        new_entity0 = ''.join(tokens_a[new_entity_pos[0][0]:new_entity_pos[0][1]])
        new_entity1 = ''.join(tokens_a[new_entity_pos[1][0]:new_entity_pos[1][1]])
        
        old_entity0 = old_entity0.lower()
        old_entity1 = old_entity1.lower()

        if '##' in new_entity0 or '##' in new_entity1:
            new_entity0 = new_entity0.replace('#','')
            new_entity1 = new_entity1.replace('#','')
        
        try:
            assert(old_entity0 == new_entity0)
            assert(old_entity1 == new_entity1)
        except:
            import pdb;pdb.set_trace()
        
        # Entity marker
        tokens_a_ = copy.deepcopy(tokens_a) 
        new_entity_pos_ = copy.deepcopy(new_entity_pos)
        entity1_start, entity1_end = new_entity_pos[0][0], new_entity_pos[0][1] 
        entity2_start, entity2_end = new_entity_pos[1][0], new_entity_pos[1][1] 
        
        tokens_a.insert(entity1_start, '<s1>') 
        new_entity_pos[0][0] = entity1_start
        tokens_a.insert(entity1_end+1, '<e1>')
        new_entity_pos[0][1] = entity1_end+1+1
        tokens_a.insert(entity2_start+2, '<s2>')
        new_entity_pos[1][0] = entity2_start+2
        tokens_a.insert(entity2_end+3,'<e2>')
        new_entity_pos[1][1] = entity2_end+3+1

        if new_entity_pos[1][1] > max_seq_length - 2 - 1:
            import pdb;pdb.set_trace()
            
        tokens_b = None
        if example.text_b:
            tokens_b = tokenizer.tokenize(example.text_b)
            # Modifies `tokens_a` and `tokens_b` in place so that the total
            # length is less than the specified length.
            # Account for [CLS], [SEP], [SEP] with "- 3"
            _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
        else:
            # Account for [CLS] and [SEP] with "- 2"
            if len(tokens_a) > max_seq_length - 2:
                tokens_a = tokens_a[:(max_seq_length - 2)]

        # The convention in BERT is:
        # (a) For sequence pairs:
        #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
        #  type_ids: 0   0  0    0    0     0       0 0    1  1  1  1   1 1
        # (b) For single sequences:
        #  tokens:   [CLS] the dog is hairy . [SEP]
        #  type_ids: 0   0   0   0  0     0 0
        #
        # Where "type_ids" are used to indicate whether this is the first
        # sequence or the second sequence. The embedding vectors for `type=0` and
        # `type=1` were learned during pre-training and are added to the wordpiece
        # embedding vector (and position vector). This is not *strictly* necessary
        # since the [SEP] token unambiguously separates the sequences, but it makes
        # it easier for the model to learn the concept of sequences.
        #
        # For classification tasks, the first vector (corresponding to [CLS]) is
        # used as as the "sentence vector". Note that this only makes sense because
        # the entire model is fine-tuned.
        tokens = ["[CLS]"] + tokens_a + ["[SEP]"]
        segment_ids = [0] * len(tokens)

        if tokens_b:
            tokens += tokens_b + ["[SEP]"]
            segment_ids += [1] * (len(tokens_b) + 1)

        input_ids = tokenizer.convert_tokens_to_ids(tokens)

        # The mask has 1 for real tokens and 0 for padding tokens. Only real
        # tokens are attended to.
        input_mask = [1] * len(input_ids)

        # Zero-pad up to the sequence length.
        padding = [0] * (max_seq_length - len(input_ids))
        input_ids += padding
        input_mask += padding
        segment_ids += padding
        

        # Used for mention pooling
        entity_mask_tag = 1
        entity_mask = [0] * len(input_ids)
        for entity in new_entity_pos:
            start, end = entity[0],entity[1]
            for i in range(start, end):
                # [CLS], need to +1 offset
                entity_mask[i+1] = entity_mask_tag
        
        """
            Different position embedding
        """
        # Strategy 1
        entity1_pos_tag = 1
        entity2_pos_tag = 2

        entity_seg_pos = [0] * len(input_ids)

        entity1_start, entity1_end = new_entity_pos[0][0], new_entity_pos[0][1] 
        for i in range(entity1_start, entity1_end):
            entity_seg_pos[i+1] = entity1_pos_tag
        entity2_start, entity2_end = new_entity_pos[1][0], new_entity_pos[1][1] 
        for i in range(entity2_start, entity2_end):
            entity_seg_pos[i+1] = entity2_pos_tag
        
        # Strategy 2
        entity_start_pos_tag = 1
        entity_seg_pos_ = [0] * len(input_ids)
        entity1_start, entity1_end = new_entity_pos[0][0], new_entity_pos[0][1] 
        entity_seg_pos_[entity1_start+1] = entity_start_pos_tag
        entity2_start, entity2_end = new_entity_pos[1][0], new_entity_pos[1][1] 
        entity_seg_pos_[entity2_start+1] = entity_start_pos_tag

        # Strategy 3
        entity_span1_pos = [0] * len(input_ids)
        entity1_start, entity1_end = new_entity_pos[0][0], new_entity_pos[0][1] 
        for i in range(len(entity_span1_pos)):
            if i < entity1_start:
                #entity_span1_pos[i] = np.abs(i - entity1_start)
                entity_span1_pos[i] = i - entity1_start
            elif entity1_start <= i and i < entity1_end:
                entity_span1_pos[i] = 0
            elif i >= entity1_end:
                entity_span1_pos[i] = i - entity1_end + 1
        
        entity_span2_pos = [0] * len(input_ids)
        entity2_start, entity2_end = new_entity_pos[1][0], new_entity_pos[1][1] 
        for i in range(len(entity_span2_pos)):
            if i < entity2_start:
                #entity_span2_pos[i] = np.abs(i - entity2_start)
                entity_span2_pos[i] = i - entity2_start
            elif entity2_start <= i and i < entity2_end:
                entity_span2_pos[i] = 0
            elif i >= entity2_end:
                entity_span2_pos[i] = i - entity2_end + 1

        # Avoid to get negative position to fuck the nn.Embedding
        #entity_span1_pos = [pos+max_seq_length-1 for pos in entity_span1_pos]
        #entity_span2_pos = [pos+max_seq_length-1 for pos in entity_span2_pos]
        
        assert len(input_ids) == max_seq_length
        assert len(input_mask) == max_seq_length
        assert len(segment_ids) == max_seq_length
        assert len(entity_mask) == max_seq_length
        assert len(entity_seg_pos) == max_seq_length
        assert len(entity_seg_pos_) == max_seq_length
        assert len(entity_span1_pos) == max_seq_length
        assert len(entity_span2_pos) == max_seq_length
        if output_mode == "classification":
            label_id = label_map[example.label]
        elif output_mode == "regression":
            label_id = float(example.label)
        else:
            raise KeyError(output_mode)

        if ex_index < 5:
            logger.info("*** Example ***")
            logger.info("guid: %s" % (example.guid))
            logger.info("tokens: %s" % " ".join(
                    [str(x) for x in tokens]))
            logger.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
            logger.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
            logger.info("entity_mask: %s" % " ".join([str(x) for x in entity_mask]))
            logger.info("entity_seg_pos: %s" % " ".join([str(x) for x in entity_seg_pos]))
            logger.info("entity_seg_pos_: %s" % " ".join([str(x) for x in entity_seg_pos_]))
            logger.info("entity_span1_pos: %s" % " ".join([str(x) for x in entity_span1_pos]))
            logger.info("entity_span2_pos: %s" % " ".join([str(x) for x in entity_span2_pos]))
            logger.info(
                    "segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
            logger.info("label: %s (id = %d)" % (example.label, label_id))
        
        #if example.guid == 'train-3':
        #    import pdb;pdb.set_trace()

        features.append(
                InputFeatures(input_ids=input_ids,
                              input_mask=input_mask,
                              segment_ids=segment_ids,
                              label_id=label_id,
                              entity_mask=entity_mask,
                              entity_seg_pos=entity_seg_pos_,
                              entity_span1_pos=entity_span1_pos,
                              entity_span2_pos=entity_span2_pos))
    return features


def _truncate_seq_pair(tokens_a, tokens_b, max_length):
    """Truncates a sequence pair in place to the maximum length."""

    # This is a simple heuristic which will always truncate the longer sequence
    # one token at a time. This makes more sense than truncating an equal percent
    # of tokens from each, since if one sequence is very short then each token
    # that's truncated likely contains more information than a longer sequence.
    while True:
        total_length = len(tokens_a) + len(tokens_b)
        if total_length <= max_length:
            break
        if len(tokens_a) > len(tokens_b):
            tokens_a.pop()
        else:
            tokens_b.pop()


def simple_accuracy(preds, labels):
    return (preds == labels).mean()


def acc_and_f1(preds, labels):
    acc = simple_accuracy(preds, labels)
    f1 = f1_score(y_true=labels, y_pred=preds,average='micro')
    report = classification_report(labels, preds)
    return {
        "acc": acc,
        "f1": f1,
        "acc_and_f1": (acc + f1) / 2,
        "report": report
    }


def pearson_and_spearman(preds, labels):
    pearson_corr = pearsonr(preds, labels)[0]
    spearman_corr = spearmanr(preds, labels)[0]
    return {
        "pearson": pearson_corr,
        "spearmanr": spearman_corr,
        "corr": (pearson_corr + spearman_corr) / 2,
    }


def compute_metrics(task_name, preds, labels):
    assert len(preds) == len(labels)
    if task_name == "cola":
        return {"mcc": matthews_corrcoef(labels, preds)}
    elif task_name == "sst-2":
        return {"acc": simple_accuracy(preds, labels)}
    elif task_name == "mrpc":
        return acc_and_f1(preds, labels)
    elif task_name == "sem":
        return acc_and_f1(preds, labels)
    elif task_name == "sts-b":
        return pearson_and_spearman(preds, labels)
    elif task_name == "qqp":
        return acc_and_f1(preds, labels)
    elif task_name == "mnli":
        return {"acc": simple_accuracy(preds, labels)}
    elif task_name == "mnli-mm":
        return {"acc": simple_accuracy(preds, labels)}
    elif task_name == "qnli":
        return {"acc": simple_accuracy(preds, labels)}
    elif task_name == "rte":
        return {"acc": simple_accuracy(preds, labels)}
    elif task_name == "wnli":
        return {"acc": simple_accuracy(preds, labels)}
    else:
        raise KeyError(task_name)


def main():
    parser = argparse.ArgumentParser()

    ## Required parameters
    parser.add_argument("--data_dir",
                        default=None,
                        type=str,
                        required=True,
                        help="The input data dir. Should contain the .tsv files (or other data files) for the task.")
    parser.add_argument("--bert_model", default=None, type=str, required=True,
                        help="Bert pre-trained model selected in the list: bert-base-uncased, "
                        "bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, "
                        "bert-base-multilingual-cased, bert-base-chinese.")
    parser.add_argument("--task_name",
                        default=None,
                        type=str,
                        required=True,
                        help="The name of the task to train.")
    parser.add_argument("--output_dir",
                        default=None,
                        type=str,
                        required=True,
                        help="The output directory where the model predictions and checkpoints will be written.")

    ## Other parameters
    parser.add_argument("--cache_dir",
                        default="",
                        type=str,
                        help="Where do you want to store the pre-trained models downloaded from s3")
    parser.add_argument("--max_seq_length",
                        default=128,
                        type=int,
                        help="The maximum total input sequence length after WordPiece tokenization. \n"
                             "Sequences longer than this will be truncated, and sequences shorter \n"
                             "than this will be padded.")
    parser.add_argument("--do_train",
                        action='store_true',
                        help="Whether to run training.")
    parser.add_argument("--do_eval",
                        action='store_true',
                        help="Whether to run eval on the dev set.")
    parser.add_argument("--do_lower_case",
                        action='store_true',
                        help="Set this flag if you are using an uncased model.")
    parser.add_argument("--train_batch_size",
                        default=32,
                        type=int,
                        help="Total batch size for training.")
    parser.add_argument("--eval_batch_size",
                        default=8,
                        type=int,
                        help="Total batch size for eval.")
    parser.add_argument("--learning_rate",
                        default=5e-5,
                        type=float,
                        help="The initial learning rate for Adam.")
    parser.add_argument("--num_train_epochs",
                        default=3.0,
                        type=float,
                        help="Total number of training epochs to perform.")
    parser.add_argument("--warmup_proportion",
                        default=0.1,
                        type=float,
                        help="Proportion of training to perform linear learning rate warmup for. "
                             "E.g., 0.1 = 10%% of training.")
    parser.add_argument("--no_cuda",
                        action='store_true',
                        help="Whether not to use CUDA when available")
    parser.add_argument("--local_rank",
                        type=int,
                        default=-1,
                        help="local_rank for distributed training on gpus")
    parser.add_argument('--seed',
                        type=int,
                        default=42,
                        help="random seed for initialization")
    parser.add_argument('--gradient_accumulation_steps',
                        type=int,
                        default=1,
                        help="Number of updates steps to accumulate before performing a backward/update pass.")
    parser.add_argument('--fp16',
                        action='store_true',
                        help="Whether to use 16-bit float precision instead of 32-bit")
    parser.add_argument('--loss_scale',
                        type=float, default=0,
                        help="Loss scaling to improve fp16 numeric stability. Only used when fp16 set to True.\n"
                             "0 (default value): dynamic loss scaling.\n"
                             "Positive power of 2: static loss scaling value.\n")
    parser.add_argument('--server_ip', type=str, default='', help="Can be used for distant debugging.")
    parser.add_argument('--server_port', type=str, default='', help="Can be used for distant debugging.")
    args = parser.parse_args()

    if args.server_ip and args.server_port:
        # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
        import ptvsd
        print("Waiting for debugger attach")
        ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
        ptvsd.wait_for_attach()

    processors = {
        "cola": ColaProcessor,
        "mnli": MnliProcessor,
        "mnli-mm": MnliMismatchedProcessor,
        "mrpc": MrpcProcessor,
        "sem": SemProcessor,
        "sst-2": Sst2Processor,
        "sts-b": StsbProcessor,
        "qqp": QqpProcessor,
        "qnli": QnliProcessor,
        "rte": RteProcessor,
        "wnli": WnliProcessor,
    }

    output_modes = {
        "cola": "classification",
        "mnli": "classification",
        "mrpc": "classification",
        "sem": "classification",
        "sst-2": "classification",
        "sts-b": "regression",
        "qqp": "classification",
        "qnli": "classification",
        "rte": "classification",
        "wnli": "classification",
    }

    if args.local_rank == -1 or args.no_cuda:
        device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
        n_gpu = torch.cuda.device_count()
    else:
        torch.cuda.set_device(args.local_rank)
        device = torch.device("cuda", args.local_rank)
        n_gpu = 1
        # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
        torch.distributed.init_process_group(backend='nccl')

    logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
                        datefmt = '%m/%d/%Y %H:%M:%S',
                        level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)

    logger.info("device: {} n_gpu: {}, distributed training: {}, 16-bits training: {}".format(
        device, n_gpu, bool(args.local_rank != -1), args.fp16))

    if args.gradient_accumulation_steps < 1:
        raise ValueError("Invalid gradient_accumulation_steps parameter: {}, should be >= 1".format(
                            args.gradient_accumulation_steps))

    args.train_batch_size = args.train_batch_size // args.gradient_accumulation_steps

    random.seed(args.seed)
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    if n_gpu > 0:
        torch.cuda.manual_seed_all(args.seed)

    if not args.do_train and not args.do_eval:
        raise ValueError("At least one of `do_train` or `do_eval` must be True.")

    if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train:
        raise ValueError("Output directory ({}) already exists and is not empty.".format(args.output_dir))
    if not os.path.exists(args.output_dir):
        os.makedirs(args.output_dir)

    task_name = args.task_name.lower()

    if task_name not in processors:
        raise ValueError("Task not found: %s" % (task_name))

    processor = processors[task_name]()
    output_mode = output_modes[task_name]

    label_list = processor.get_labels()
    num_labels = len(label_list)
    tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case)
    train_examples = None
    num_train_optimization_steps = None
    if args.do_train:
        train_examples = processor.get_train_examples(args.data_dir)
        num_train_optimization_steps = int(
            len(train_examples) / args.train_batch_size / args.gradient_accumulation_steps) * args.num_train_epochs
        if args.local_rank != -1:
            num_train_optimization_steps = num_train_optimization_steps // torch.distributed.get_world_size()

    # Prepare model
    cache_dir = args.cache_dir if args.cache_dir else os.path.join(str(PYTORCH_PRETRAINED_BERT_CACHE), 'distributed_{}'.format(args.local_rank))
    model = BertForSequenceClassification.from_pretrained(args.bert_model,
              cache_dir=cache_dir,
              num_labels=num_labels)
    if args.fp16:
        model.half()
    model.to(device)
    if args.local_rank != -1:
        try:
            from apex.parallel import DistributedDataParallel as DDP
        except ImportError:
            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use distributed and fp16 training.")

        model = DDP(model)
    elif n_gpu > 1:
        model = torch.nn.DataParallel(model)

    # Prepare optimizer
    if args.do_train:
        param_optimizer = list(model.named_parameters())
        no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
        optimizer_grouped_parameters = [
            {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
            {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
            ]
        if args.fp16:
            try:
                from apex.optimizers import FP16_Optimizer
                from apex.optimizers import FusedAdam
            except ImportError:
                raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use distributed and fp16 training.")

            optimizer = FusedAdam(optimizer_grouped_parameters,
                                  lr=args.learning_rate,
                                  bias_correction=False,
                                  max_grad_norm=1.0)
            if args.loss_scale == 0:
                optimizer = FP16_Optimizer(optimizer, dynamic_loss_scale=True)
            else:
                optimizer = FP16_Optimizer(optimizer, static_loss_scale=args.loss_scale)
            warmup_linear = WarmupLinearSchedule(warmup=args.warmup_proportion,
                                                 t_total=num_train_optimization_steps)

        else:
            optimizer = BertAdam(optimizer_grouped_parameters,
                                 lr=args.learning_rate,
                                 warmup=args.warmup_proportion,
                                 t_total=num_train_optimization_steps)

    global_step = 0
    nb_tr_steps = 0
    tr_loss = 0
    if args.do_train:
        train_features = convert_examples_to_features(
            train_examples, label_list, args.max_seq_length, tokenizer, output_mode)
        logger.info("***** Running training *****")
        logger.info("  Num examples = %d", len(train_examples))
        logger.info("  Batch size = %d", args.train_batch_size)
        logger.info("  Num steps = %d", num_train_optimization_steps)
        all_input_ids = torch.tensor([f.input_ids for f in train_features], dtype=torch.long)
        all_input_mask = torch.tensor([f.input_mask for f in train_features], dtype=torch.long)
        # FloatTensor(forward)
        all_entity_mask = torch.tensor([f.entity_mask for f in train_features], dtype=torch.float)
        all_entity_seg_pos = torch.tensor([f.entity_seg_pos for f in train_features], dtype=torch.long)
        all_entity_span1_pos = torch.tensor([f.entity_span1_pos for f in train_features], dtype=torch.float)
        all_entity_span2_pos = torch.tensor([f.entity_span2_pos for f in train_features], dtype=torch.float)
        all_segment_ids = torch.tensor([f.segment_ids for f in train_features], dtype=torch.long)
        if output_mode == "classification":
            all_label_ids = torch.tensor([f.label_id for f in train_features], dtype=torch.long)
        elif output_mode == "regression":
            all_label_ids = torch.tensor([f.label_id for f in train_features], dtype=torch.float)

        train_data = TensorDataset(all_input_ids, all_input_mask, all_entity_mask, all_entity_seg_pos, all_entity_span1_pos, all_entity_span2_pos, all_segment_ids, all_label_ids)
        if args.local_rank == -1:
            train_sampler = RandomSampler(train_data)
        else:
            train_sampler = DistributedSampler(train_data)
        train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=args.train_batch_size)

        model.train()
        for _ in trange(int(args.num_train_epochs), desc="Epoch"):
            tr_loss = 0
            nb_tr_examples, nb_tr_steps = 0, 0
            for step, batch in enumerate(tqdm(train_dataloader, desc="Iteration")):
                batch = tuple(t.to(device) for t in batch)
                input_ids, input_mask, entity_mask, entity_seg_pos, entity_span1_pos, entity_span2_pos, segment_ids, label_ids = batch
                # define a new function to compute loss values for both output_modes
                logits = model(input_ids, segment_ids, input_mask, entity_mask, entity_seg_pos, entity_span1_pos, entity_span2_pos, labels=None)

                if output_mode == "classification":
                    loss_fct = CrossEntropyLoss()
                    loss = loss_fct(logits.view(-1, num_labels), label_ids.view(-1))
                elif output_mode == "regression":
                    loss_fct = MSELoss()
                    loss = loss_fct(logits.view(-1), label_ids.view(-1))

                if n_gpu > 1:
                    loss = loss.mean() # mean() to average on multi-gpu.
                if args.gradient_accumulation_steps > 1:
                    loss = loss / args.gradient_accumulation_steps

                if args.fp16:
                    optimizer.backward(loss)
                else:
                    loss.backward()

                tr_loss += loss.item()
                nb_tr_examples += input_ids.size(0)
                nb_tr_steps += 1
                if (step + 1) % args.gradient_accumulation_steps == 0:
                    if args.fp16:
                        # modify learning rate with special warm up BERT uses
                        # if args.fp16 is False, BertAdam is used that handles this automatically
                        lr_this_step = args.learning_rate * warmup_linear.get_lr(global_step, args.warmup_proportion)
                        for param_group in optimizer.param_groups:
                            param_group['lr'] = lr_this_step
                    optimizer.step()
                    optimizer.zero_grad()
                    global_step += 1

    if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
        # Save a trained model, configuration and tokenizer
        model_to_save = model.module if hasattr(model, 'module') else model  # Only save the model it-self

        # If we save using the predefined names, we can load using `from_pretrained`
        output_model_file = os.path.join(args.output_dir, WEIGHTS_NAME)
        output_config_file = os.path.join(args.output_dir, CONFIG_NAME)

        torch.save(model_to_save.state_dict(), output_model_file)
        model_to_save.config.to_json_file(output_config_file)
        tokenizer.save_vocabulary(args.output_dir)

        # Load a trained model and vocabulary that you have fine-tuned
        model = BertForSequenceClassification.from_pretrained(args.output_dir, num_labels=num_labels)
        tokenizer = BertTokenizer.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
    else:
        model = BertForSequenceClassification.from_pretrained(args.bert_model, num_labels=num_labels)
    model.to(device)

    if args.do_eval and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
        eval_examples = processor.get_dev_examples(args.data_dir)
        eval_features = convert_examples_to_features(
            eval_examples, label_list, args.max_seq_length, tokenizer, output_mode)
        logger.info("***** Running evaluation *****")
        logger.info("  Num examples = %d", len(eval_examples))
        logger.info("  Batch size = %d", args.eval_batch_size)
        all_input_ids = torch.tensor([f.input_ids for f in eval_features], dtype=torch.long)
        all_input_mask = torch.tensor([f.input_mask for f in eval_features], dtype=torch.long)
        all_entity_mask = torch.tensor([f.entity_mask for f in eval_features], dtype=torch.float)
        all_entity_seg_pos = torch.tensor([f.entity_seg_pos for f in eval_features], dtype=torch.long)
        all_entity_span1_pos = torch.tensor([f.entity_span1_pos for f in eval_features], dtype=torch.float)
        all_entity_span2_pos = torch.tensor([f.entity_span2_pos for f in eval_features], dtype=torch.float)
        all_segment_ids = torch.tensor([f.segment_ids for f in eval_features], dtype=torch.long)

        if output_mode == "classification":
            all_label_ids = torch.tensor([f.label_id for f in eval_features], dtype=torch.long)
        elif output_mode == "regression":
            all_label_ids = torch.tensor([f.label_id for f in eval_features], dtype=torch.float)

        eval_data = TensorDataset(all_input_ids, all_input_mask, all_entity_mask, all_entity_seg_pos, all_entity_span1_pos, all_entity_span2_pos, all_segment_ids, all_label_ids)
        # Run prediction for full data
        eval_sampler = SequentialSampler(eval_data)
        eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=args.eval_batch_size)

        model.eval()
        eval_loss = 0
        nb_eval_steps = 0
        preds = []

        for input_ids, input_mask, entity_mask, entity_seg_pos, entity_span1_pos, entity_span2_pos, segment_ids, label_ids in tqdm(eval_dataloader, desc="Evaluating"):
            input_ids = input_ids.to(device)
            input_mask = input_mask.to(device)
            entity_mask = entity_mask.to(device)
            entity_seg_pos = entity_seg_pos.to(device)
            entity_span1_pos = entity_span1_pos.to(device)
            entity_span2_pos = entity_span2_pos.to(device)
            segment_ids = segment_ids.to(device)
            label_ids = label_ids.to(device)
            with torch.no_grad():
                logits = model(input_ids, segment_ids, input_mask, entity_mask, entity_seg_pos, entity_span1_pos, entity_span2_pos, labels=None)
                #logits = model(input_ids, segment_ids, input_mask, labels=None)

            # create eval loss and other metric required by the task
            if output_mode == "classification":
                loss_fct = CrossEntropyLoss()
                tmp_eval_loss = loss_fct(logits.view(-1, num_labels), label_ids.view(-1))
            elif output_mode == "regression":
                loss_fct = MSELoss()
                tmp_eval_loss = loss_fct(logits.view(-1), label_ids.view(-1))
            
            eval_loss += tmp_eval_loss.mean().item()
            nb_eval_steps += 1
            if len(preds) == 0:
                preds.append(logits.detach().cpu().numpy())
            else:
                preds[0] = np.append(
                    preds[0], logits.detach().cpu().numpy(), axis=0)

        eval_loss = eval_loss / nb_eval_steps
        preds = preds[0]
        if output_mode == "classification":
            preds = np.argmax(preds, axis=1)
        elif output_mode == "regression":
            preds = np.squeeze(preds)
        result = compute_metrics(task_name, preds, all_label_ids.numpy())
        loss = tr_loss/global_step if args.do_train else None

        result['eval_loss'] = eval_loss
        result['global_step'] = global_step
        result['loss'] = loss

        output_eval_file = os.path.join(args.output_dir, "eval_results.txt")
        with open(output_eval_file, "w") as writer:
            logger.info("***** Eval results *****")
            for key in sorted(result.keys()):
                logger.info("  %s = %s", key, str(result[key]))
                writer.write("%s = %s\n" % (key, str(result[key])))

        # hack for MNLI-MM
        if task_name == "mnli":
            task_name = "mnli-mm"
            processor = processors[task_name]()

            if os.path.exists(args.output_dir + '-MM') and os.listdir(args.output_dir + '-MM') and args.do_train:
                raise ValueError("Output directory ({}) already exists and is not empty.".format(args.output_dir))
            if not os.path.exists(args.output_dir + '-MM'):
                os.makedirs(args.output_dir + '-MM')

            eval_examples = processor.get_dev_examples(args.data_dir)
            eval_features = convert_examples_to_features(
                eval_examples, label_list, args.max_seq_length, tokenizer, output_mode)
            logger.info("***** Running evaluation *****")
            logger.info("  Num examples = %d", len(eval_examples))
            logger.info("  Batch size = %d", args.eval_batch_size)
            all_input_ids = torch.tensor([f.input_ids for f in eval_features], dtype=torch.long)
            all_input_mask = torch.tensor([f.input_mask for f in eval_features], dtype=torch.long)
            all_segment_ids = torch.tensor([f.segment_ids for f in eval_features], dtype=torch.long)
            all_label_ids = torch.tensor([f.label_id for f in eval_features], dtype=torch.long)

            eval_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids)
            # Run prediction for full data
            eval_sampler = SequentialSampler(eval_data)
            eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=args.eval_batch_size)

            model.eval()
            eval_loss = 0
            nb_eval_steps = 0
            preds = []

            for input_ids, input_mask, segment_ids, label_ids in tqdm(eval_dataloader, desc="Evaluating"):
                input_ids = input_ids.to(device)
                input_mask = input_mask.to(device)
                segment_ids = segment_ids.to(device)
                label_ids = label_ids.to(device)

                with torch.no_grad():
                    logits = model(input_ids, segment_ids, input_mask, labels=None)
            
                loss_fct = CrossEntropyLoss()
                tmp_eval_loss = loss_fct(logits.view(-1, num_labels), label_ids.view(-1))
            
                eval_loss += tmp_eval_loss.mean().item()
                nb_eval_steps += 1
                if len(preds) == 0:
                    preds.append(logits.detach().cpu().numpy())
                else:
                    preds[0] = np.append(
                        preds[0], logits.detach().cpu().numpy(), axis=0)

            eval_loss = eval_loss / nb_eval_steps
            preds = preds[0]
            preds = np.argmax(preds, axis=1)
            result = compute_metrics(task_name, preds, all_label_ids.numpy())
            loss = tr_loss/global_step if args.do_train else None

            result['eval_loss'] = eval_loss
            result['global_step'] = global_step
            result['loss'] = loss

            output_eval_file = os.path.join(args.output_dir + '-MM', "eval_results.txt")
            with open(output_eval_file, "w") as writer:
                logger.info("***** Eval results *****")
                for key in sorted(result.keys()):
                    logger.info("  %s = %s", key, str(result[key]))
                    writer.write("%s = %s\n" % (key, str(result[key])))

if __name__ == "__main__":
    main()


================================================
FILE: examples/run_classifier_dataset_utils.py
================================================
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" BERT classification fine-tuning: utilities to work with GLUE tasks """

from __future__ import absolute_import, division, print_function

import csv
import logging
import os
import sys

from scipy.stats import pearsonr, spearmanr
from sklearn.metrics import matthews_corrcoef, f1_score

logger = logging.getLogger(__name__)


class InputExample(object):
    """A single training/test example for simple sequence classification."""

    def __init__(self, guid, text_a, text_b=None, label=None):
        """Constructs a InputExample.

        Args:
            guid: Unique id for the example.
            text_a: string. The untokenized text of the first sequence. For single
            sequence tasks, only this sequence must be specified.
            text_b: (Optional) string. The untokenized text of the second sequence.
            Only must be specified for sequence pair tasks.
            label: (Optional) string. The label of the example. This should be
            specified for train and dev examples, but not for test examples.
        """
        self.guid = guid
        self.text_a = text_a
        self.text_b = text_b
        self.label = label


class InputFeatures(object):
    """A single set of features of data."""

    def __init__(self, input_ids, input_mask, segment_ids, label_id):
        self.input_ids = input_ids
        self.input_mask = input_mask
        self.segment_ids = segment_ids
        self.label_id = label_id


class DataProcessor(object):
    """Base class for data converters for sequence classification data sets."""

    def get_train_examples(self, data_dir):
        """Gets a collection of `InputExample`s for the train set."""
        raise NotImplementedError()

    def get_dev_examples(self, data_dir):
        """Gets a collection of `InputExample`s for the dev set."""
        raise NotImplementedError()

    def get_labels(self):
        """Gets the list of labels for this data set."""
        raise NotImplementedError()

    @classmethod
    def _read_tsv(cls, input_file, quotechar=None):
        """Reads a tab separated value file."""
        with open(input_file, "r", encoding="utf-8") as f:
            reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
            lines = []
            for line in reader:
                if sys.version_info[0] == 2:
                    line = list(unicode(cell, 'utf-8') for cell in line)
                lines.append(line)
            return lines


class MrpcProcessor(DataProcessor):
    """Processor for the MRPC data set (GLUE version)."""

    def get_train_examples(self, data_dir):
        """See base class."""
        logger.info("LOOKING AT {}".format(os.path.join(data_dir, "train.tsv")))
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")

    def get_dev_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")

    def get_labels(self):
        """See base class."""
        return ["0", "1"]

    def _create_examples(self, lines, set_type):
        """Creates examples for the training and dev sets."""
        examples = []
        for (i, line) in enumerate(lines):
            if i == 0:
                continue
            guid = "%s-%s" % (set_type, i)
            text_a = line[3]
            text_b = line[4]
            label = line[0]
            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
        return examples


class MnliProcessor(DataProcessor):
    """Processor for the MultiNLI data set (GLUE version)."""

    def get_train_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")

    def get_dev_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "dev_matched.tsv")),
            "dev_matched")

    def get_labels(self):
        """See base class."""
        return ["contradiction", "entailment", "neutral"]

    def _create_examples(self, lines, set_type):
        """Creates examples for the training and dev sets."""
        examples = []
        for (i, line) in enumerate(lines):
            if i == 0:
                continue
            guid = "%s-%s" % (set_type, line[0])
            text_a = line[8]
            text_b = line[9]
            label = line[-1]
            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
        return examples


class MnliMismatchedProcessor(MnliProcessor):
    """Processor for the MultiNLI Mismatched data set (GLUE version)."""

    def get_dev_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "dev_mismatched.tsv")),
            "dev_matched")


class ColaProcessor(DataProcessor):
    """Processor for the CoLA data set (GLUE version)."""

    def get_train_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")

    def get_dev_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")

    def get_labels(self):
        """See base class."""
        return ["0", "1"]

    def _create_examples(self, lines, set_type):
        """Creates examples for the training and dev sets."""
        examples = []
        for (i, line) in enumerate(lines):
            guid = "%s-%s" % (set_type, i)
            text_a = line[3]
            label = line[1]
            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
        return examples


class Sst2Processor(DataProcessor):
    """Processor for the SST-2 data set (GLUE version)."""

    def get_train_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")

    def get_dev_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")

    def get_labels(self):
        """See base class."""
        return ["0", "1"]

    def _create_examples(self, lines, set_type):
        """Creates examples for the training and dev sets."""
        examples = []
        for (i, line) in enumerate(lines):
            if i == 0:
                continue
            guid = "%s-%s" % (set_type, i)
            text_a = line[0]
            label = line[1]
            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
        return examples


class StsbProcessor(DataProcessor):
    """Processor for the STS-B data set (GLUE version)."""

    def get_train_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")

    def get_dev_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")

    def get_labels(self):
        """See base class."""
        return [None]

    def _create_examples(self, lines, set_type):
        """Creates examples for the training and dev sets."""
        examples = []
        for (i, line) in enumerate(lines):
            if i == 0:
                continue
            guid = "%s-%s" % (set_type, line[0])
            text_a = line[7]
            text_b = line[8]
            label = line[-1]
            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
        return examples


class QqpProcessor(DataProcessor):
    """Processor for the QQP data set (GLUE version)."""

    def get_train_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")

    def get_dev_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")

    def get_labels(self):
        """See base class."""
        return ["0", "1"]

    def _create_examples(self, lines, set_type):
        """Creates examples for the training and dev sets."""
        examples = []
        for (i, line) in enumerate(lines):
            if i == 0:
                continue
            guid = "%s-%s" % (set_type, line[0])
            try:
                text_a = line[3]
                text_b = line[4]
                label = line[5]
            except IndexError:
                continue
            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
        return examples


class QnliProcessor(DataProcessor):
    """Processor for the QNLI data set (GLUE version)."""

    def get_train_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")

    def get_dev_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "dev.tsv")), 
            "dev_matched")

    def get_labels(self):
        """See base class."""
        return ["entailment", "not_entailment"]

    def _create_examples(self, lines, set_type):
        """Creates examples for the training and dev sets."""
        examples = []
        for (i, line) in enumerate(lines):
            if i == 0:
                continue
            guid = "%s-%s" % (set_type, line[0])
            text_a = line[1]
            text_b = line[2]
            label = line[-1]
            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
        return examples


class RteProcessor(DataProcessor):
    """Processor for the RTE data set (GLUE version)."""

    def get_train_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")

    def get_dev_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")

    def get_labels(self):
        """See base class."""
        return ["entailment", "not_entailment"]

    def _create_examples(self, lines, set_type):
        """Creates examples for the training and dev sets."""
        examples = []
        for (i, line) in enumerate(lines):
            if i == 0:
                continue
            guid = "%s-%s" % (set_type, line[0])
            text_a = line[1]
            text_b = line[2]
            label = line[-1]
            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
        return examples


class WnliProcessor(DataProcessor):
    """Processor for the WNLI data set (GLUE version)."""

    def get_train_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")

    def get_dev_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")

    def get_labels(self):
        """See base class."""
        return ["0", "1"]

    def _create_examples(self, lines, set_type):
        """Creates examples for the training and dev sets."""
        examples = []
        for (i, line) in enumerate(lines):
            if i == 0:
                continue
            guid = "%s-%s" % (set_type, line[0])
            text_a = line[1]
            text_b = line[2]
            label = line[-1]
            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
        return examples


def convert_examples_to_features(examples, label_list, max_seq_length,
                                 tokenizer, output_mode):
    """Loads a data file into a list of `InputBatch`s."""

    label_map = {label : i for i, label in enumerate(label_list)}

    features = []
    for (ex_index, example) in enumerate(examples):
        if ex_index % 10000 == 0:
            logger.info("Writing example %d of %d" % (ex_index, len(examples)))

        tokens_a = tokenizer.tokenize(example.text_a)

        tokens_b = None
        if example.text_b:
            tokens_b = tokenizer.tokenize(example.text_b)
            # Modifies `tokens_a` and `tokens_b` in place so that the total
            # length is less than the specified length.
            # Account for [CLS], [SEP], [SEP] with "- 3"
            _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
        else:
            # Account for [CLS] and [SEP] with "- 2"
            if len(tokens_a) > max_seq_length - 2:
                tokens_a = tokens_a[:(max_seq_length - 2)]

        # The convention in BERT is:
        # (a) For sequence pairs:
        #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
        #  type_ids: 0   0  0    0    0     0       0 0    1  1  1  1   1 1
        # (b) For single sequences:
        #  tokens:   [CLS] the dog is hairy . [SEP]
        #  type_ids: 0   0   0   0  0     0 0
        #
        # Where "type_ids" are used to indicate whether this is the first
        # sequence or the second sequence. The embedding vectors for `type=0` and
        # `type=1` were learned during pre-training and are added to the wordpiece
        # embedding vector (and position vector). This is not *strictly* necessary
        # since the [SEP] token unambiguously separates the sequences, but it makes
        # it easier for the model to learn the concept of sequences.
        #
        # For classification tasks, the first vector (corresponding to [CLS]) is
        # used as as the "sentence vector". Note that this only makes sense because
        # the entire model is fine-tuned.
        tokens = ["[CLS]"] + tokens_a + ["[SEP]"]
        segment_ids = [0] * len(tokens)

        if tokens_b:
            tokens += tokens_b + ["[SEP]"]
            segment_ids += [1] * (len(tokens_b) + 1)

        input_ids = tokenizer.convert_tokens_to_ids(tokens)

        # The mask has 1 for real tokens and 0 for padding tokens. Only real
        # tokens are attended to.
        input_mask = [1] * len(input_ids)

        # Zero-pad up to the sequence length.
        padding = [0] * (max_seq_length - len(input_ids))
        input_ids += padding
        input_mask += padding
        segment_ids += padding

        assert len(input_ids) == max_seq_length
        assert len(input_mask) == max_seq_length
        assert len(segment_ids) == max_seq_length

        if output_mode == "classification":
            label_id = label_map[example.label]
        elif output_mode == "regression":
            label_id = float(example.label)
        else:
            raise KeyError(output_mode)

        if ex_index < 5:
            logger.info("*** Example ***")
            logger.info("guid: %s" % (example.guid))
            logger.info("tokens: %s" % " ".join(
                    [str(x) for x in tokens]))
            logger.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
            logger.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
            logger.info(
                    "segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
            logger.info("label: %s (id = %d)" % (example.label, label_id))

        features.append(
                InputFeatures(input_ids=input_ids,
                              input_mask=input_mask,
                              segment_ids=segment_ids,
                              label_id=label_id))
    return features


def _truncate_seq_pair(tokens_a, tokens_b, max_length):
    """Truncates a sequence pair in place to the maximum length."""

    # This is a simple heuristic which will always truncate the longer sequence
    # one token at a time. This makes more sense than truncating an equal percent
    # of tokens from each, since if one sequence is very short then each token
    # that's truncated likely contains more information than a longer sequence.
    while True:
        total_length = len(tokens_a) + len(tokens_b)
        if total_length <= max_length:
            break
        if len(tokens_a) > len(tokens_b):
            tokens_a.pop()
        else:
            tokens_b.pop()


def simple_accuracy(preds, labels):
    return (preds == labels).mean()


def acc_and_f1(preds, labels):
    acc = simple_accuracy(preds, labels)
    f1 = f1_score(y_true=labels, y_pred=preds)
    return {
        "acc": acc,
        "f1": f1,
        "acc_and_f1": (acc + f1) / 2,
    }


def pearson_and_spearman(preds, labels):
    pearson_corr = pearsonr(preds, labels)[0]
    spearman_corr = spearmanr(preds, labels)[0]
    return {
        "pearson": pearson_corr,
        "spearmanr": spearman_corr,
        "corr": (pearson_corr + spearman_corr) / 2,
    }


def compute_metrics(task_name, preds, labels):
    assert len(preds) == len(labels)
    if task_name == "cola":
        return {"mcc": matthews_corrcoef(labels, preds)}
    elif task_name == "sst-2":
        return {"acc": simple_accuracy(preds, labels)}
    elif task_name == "mrpc":
        return acc_and_f1(preds, labels)
    elif task_name == "sts-b":
        return pearson_and_spearman(preds, labels)
    elif task_name == "qqp":
        return acc_and_f1(preds, labels)
    elif task_name == "mnli":
        return {"acc": simple_accuracy(preds, labels)}
    elif task_name == "mnli-mm":
        return {"acc": simple_accuracy(preds, labels)}
    elif task_name == "qnli":
        return {"acc": simple_accuracy(preds, labels)}
    elif task_name == "rte":
        return {"acc": simple_accuracy(preds, labels)}
    elif task_name == "wnli":
        return {"acc": simple_accuracy(preds, labels)}
    else:
        raise KeyError(task_name)

processors = {
    "cola": ColaProcessor,
    "mnli": MnliProcessor,
    "mnli-mm": MnliMismatchedProcessor,
    "mrpc": MrpcProcessor,
    "sst-2": Sst2Processor,
    "sts-b": StsbProcessor,
    "qqp": QqpProcessor,
    "qnli": QnliProcessor,
    "rte": RteProcessor,
    "wnli": WnliProcessor,
}

output_modes = {
    "cola": "classification",
    "mnli": "classification",
    "mrpc": "classification",
    "sst-2": "classification",
    "sts-b": "regression",
    "qqp": "classification",
    "qnli": "classification",
    "rte": "classification",
    "wnli": "classification",
}


================================================
FILE: examples/run_gpt2.py
================================================
#!/usr/bin/env python3

import argparse
import logging
from tqdm import trange

import torch
import torch.nn.functional as F
import numpy as np

from pytorch_pretrained_bert import GPT2LMHeadModel, GPT2Tokenizer

logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
                    datefmt = '%m/%d/%Y %H:%M:%S',
                    level = logging.INFO)
logger = logging.getLogger(__name__)

def top_k_logits(logits, k):
    """
    Masks everything but the k top entries as -infinity (1e10).
    Used to mask logits such that e^-infinity -> 0 won't contribute to the
    sum of the denominator.
    """
    if k == 0:
        return logits
    else:
        values = torch.topk(logits, k)[0]
        batch_mins = values[:, -1].view(-1, 1).expand_as(logits)
        return torch.where(logits < batch_mins, torch.ones_like(logits) * -1e10, logits)

def sample_sequence(model, length, start_token=None, batch_size=None, context=None, temperature=1, top_k=0, device='cuda', sample=True):
    if start_token is None:
        assert context is not None, 'Specify exactly one of start_token and context!'
        context = torch.tensor(context, device=device, dtype=torch.long).unsqueeze(0).repeat(batch_size, 1)
    else:
        assert context is None, 'Specify exactly one of start_token and context!'
        context = torch.full((batch_size, 1), start_token, device=device, dtype=torch.long)
    prev = context
    output = context
    past = None
    with torch.no_grad():
        for i in trange(length):
            logits, past = model(prev, past=past)
            logits = logits[:, -1, :] / temperature
            logits = top_k_logits(logits, k=top_k)
            log_probs = F.softmax(logits, dim=-1)
            if sample:
                prev = torch.multinomial(log_probs, num_samples=1)
            else:
                _, prev = torch.topk(log_probs, k=1, dim=-1)
            output = torch.cat((output, prev), dim=1)
    return output

def run_model():
    parser = argparse.ArgumentParser()
    parser.add_argument('--model_name_or_path', type=str, default='gpt2', help='pretrained model name or path to local checkpoint')
    parser.add_argument("--seed", type=int, default=0)
    parser.add_argument("--nsamples", type=int, default=1)
    parser.add_argument("--batch_size", type=int, default=-1)
    parser.add_argument("--length", type=int, default=-1)
    parser.add_argument("--temperature", type=float, default=1.0)
    parser.add_argument("--top_k", type=int, default=0)
    parser.add_argument('--unconditional', action='store_true', help='If true, unconditional generation.')
    args = parser.parse_args()
    print(args)

    if args.batch_size == -1:
        args.batch_size = 1
    assert args.nsamples % args.batch_size == 0

    np.random.seed(args.seed)
    torch.random.manual_seed(args.seed)
    torch.cuda.manual_seed(args.seed)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    enc = GPT2Tokenizer.from_pretrained(args.model_name_or_path)
    model = GPT2LMHeadModel.from_pretrained(args.model_name_or_path)
    model.to(device)
    model.eval()

    if args.length == -1:
        args.length = model.config.n_ctx // 2
    elif args.length > model.config.n_ctx:
        raise ValueError("Can't get samples longer than window size: %s" % model.config.n_ctx)

    while True:
        context_tokens = []
        if not args.unconditional:
            raw_text = input("Model prompt >>> ")
            while not raw_text:
                print('Prompt should not be empty!')
                raw_text = input("Model prompt >>> ")
            context_tokens = enc.encode(raw_text)
            generated = 0
            for _ in range(args.nsamples // args.batch_size):
                out = sample_sequence(
                    model=model, length=args.length,
                    context=context_tokens,
                    start_token=None,
                    batch_size=args.batch_size,
                    temperature=args.temperature, top_k=args.top_k, device=device
                )
                out = out[:, len(context_tokens):].tolist()
                for i in range(args.batch_size):
                    generated += 1
                    text = enc.decode(out[i])
                    print("=" * 40 + " SAMPLE " + str(generated) + " " + "=" * 40)
                    print(text)
            print("=" * 80)
        else:
            generated = 0
            for _ in range(args.nsamples // args.batch_size):
                out = sample_sequence(
                    model=model, length=args.length,
                    context=None,
                    start_token=enc.encoder['<|endoftext|>'],
                    batch_size=args.batch_size,
                    temperature=args.temperature, top_k=args.top_k, device=device
                )
                out = out[:,1:].tolist()
                for i in range(args.batch_size):
                    generated += 1
                    text = enc.decode(out[i])
                    print("=" * 40 + " SAMPLE " + str(generated) + " " + "=" * 40)
                    print(text)
            print("=" * 80)

if __name__ == '__main__':
    run_model()




================================================
FILE: examples/run_openai_gpt.py
================================================
# coding=utf-8
# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" OpenAI GPT model fine-tuning script.
    Adapted from https://github.com/huggingface/pytorch-openai-transformer-lm/blob/master/train.py
    It self adapted from https://github.com/openai/finetune-transformer-lm/blob/master/train.py

    This script with default values fine-tunes and evaluate a pretrained OpenAI GPT on the RocStories dataset:
        python run_openai_gpt.py \
          --model_name openai-gpt \
          --do_train \
          --do_eval \
          --train_dataset $ROC_STORIES_DIR/cloze_test_val__spring2016\ -\ cloze_test_ALL_val.csv \
          --eval_dataset $ROC_STORIES_DIR/cloze_test_test__spring2016\ -\ cloze_test_ALL_test.csv \
          --output_dir ../log \
          --train_batch_size 16 \
"""
import argparse
import os
import csv
import random
import logging
from tqdm import tqdm, trange

import numpy as np
import torch
from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
                              TensorDataset)

from pytorch_pretrained_bert import (OpenAIGPTDoubleHeadsModel, OpenAIGPTTokenizer,
                                     OpenAIAdam, cached_path, WEIGHTS_NAME, CONFIG_NAME)

ROCSTORIES_URL = "https://s3.amazonaws.com/datasets.huggingface.co/ROCStories.tar.gz"

logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
                    datefmt = '%m/%d/%Y %H:%M:%S',
                    level = logging.INFO)
logger = logging.getLogger(__name__)

def accuracy(out, labels):
    outputs = np.argmax(out, axis=1)
    return np.sum(outputs == labels)

def load_rocstories_dataset(dataset_path):
    """ Output a list of tuples(story, 1st continuation, 2nd continuation, label) """
    with open(dataset_path, encoding='utf_8') as f:
        f = csv.reader(f)
        output = []
        next(f) # skip the first line
        for line in tqdm(f):
            output.append((' '.join(line[1:5]), line[5], line[6], int(line[-1])-1))
    return output

def pre_process_datasets(encoded_datasets, input_len, cap_length, start_token, delimiter_token, clf_token):
    """ Pre-process datasets containing lists of tuples(story, 1st continuation, 2nd continuation, label)

        To Transformer inputs of shape (n_batch, n_alternative, length) comprising for each batch, continuation:
        input_ids[batch, alternative, :] = [start_token] + story[:cap_length] + [delimiter_token] + cont1[:cap_length] + [clf_token]
    """
    tensor_datasets = []
    for dataset in encoded_datasets:
        n_batch = len(dataset)
        input_ids = np.zeros((n_batch, 2, input_len), dtype=np.int64)
        mc_token_ids = np.zeros((n_batch, 2), dtype=np.int64)
        lm_labels = np.full((n_batch, 2, input_len), fill_value=-1, dtype=np.int64)
        mc_labels = np.zeros((n_batch,), dtype=np.int64)
        for i, (story, cont1, cont2, mc_label), in enumerate(dataset):
            with_cont1 = [start_token] + story[:cap_length] + [delimiter_token] + cont1[:cap_length] + [clf_token]
            with_cont2 = [start_token] + story[:cap_length] + [delimiter_token] + cont2[:cap_length] + [clf_token]
            input_ids[i, 0, :len(with_cont1)] = with_cont1
            input_ids[i, 1, :len(with_cont2)] = with_cont2
            mc_token_ids[i, 0] = len(with_cont1) - 1
            mc_token_ids[i, 1] = len(with_cont2) - 1
            lm_labels[i, 0, :len(with_cont1)] = with_cont1
            lm_labels[i, 1, :len(with_cont2)] = with_cont2
            mc_labels[i] = mc_label
        all_inputs = (input_ids, mc_token_ids, lm_labels, mc_labels)
        tensor_datasets.append(tuple(torch.tensor(t) for t in all_inputs))
    return tensor_datasets

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--model_name', type=str, default='openai-gpt',
                        help='pretrained model name')
    parser.add_argument("--do_train", action='store_true', help="Whether to run training.")
    parser.add_argument("--do_eval", action='store_true', help="Whether to run eval on the dev set.")
    parser.add_argument("--output_dir", default=None, type=str, required=True,
                        help="The output directory where the model predictions and checkpoints will be written.")
    parser.add_argument('--train_dataset', type=str, default='')
    parser.add_argument('--eval_dataset', type=str, default='')
    parser.add_argument('--seed', type=int, default=42)
    parser.add_argument('--num_train_epochs', type=int, default=3)
    parser.add_argument('--train_batch_size', type=int, default=8)
    parser.add_argument('--eval_batch_size', type=int, default=16)
    parser.add_argument('--max_grad_norm', type=int, default=1)
    parser.add_argument('--learning_rate', type=float, default=6.25e-5)
    parser.add_argument('--warmup_proportion', type=float, default=0.002)
    parser.add_argument('--lr_schedule', type=str, default='warmup_linear')
    parser.add_argument('--weight_decay', type=float, default=0.01)
    parser.add_argument('--lm_coef', type=float, default=0.9)
    parser.add_argument('--n_valid', type=int, default=374)

    parser.add_argument('--server_ip', type=str, default='', help="Can be used for distant debugging.")
    parser.add_argument('--server_port', type=str, default='', help="Can be used for distant debugging.")
    args = parser.parse_args()
    print(args)

    if args.server_ip and args.server_port:
        # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
        import ptvsd
        print("Waiting for debugger attach")
        ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
        ptvsd.wait_for_attach()

    random.seed(args.seed)
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    torch.cuda.manual_seed_all(args.seed)

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    n_gpu = torch.cuda.device_count()
    logger.info("device: {}, n_gpu {}".format(device, n_gpu))

    if not args.do_train and not args.do_eval:
        raise ValueError("At least one of `do_train` or `do_eval` must be True.")

    if not os.path.exists(args.output_dir):
        os.makedirs(args.output_dir)

    # Load tokenizer and model
    # This loading functions also add new tokens and embeddings called `special tokens`
    # These new embeddings will be fine-tuned on the RocStories dataset
    special_tokens = ['_start_', '_delimiter_', '_classify_']
    tokenizer = OpenAIGPTTokenizer.from_pretrained(args.model_name, special_tokens=special_tokens)
    special_tokens_ids = list(tokenizer.convert_tokens_to_ids(token) for token in special_tokens)
    model = OpenAIGPTDoubleHeadsModel.from_pretrained(args.model_name, num_special_tokens=len(special_tokens))
    model.to(device)

    # Load and encode the datasets
    if not args.train_dataset and not args.eval_dataset:
        roc_stories = cached_path(ROCSTORIES_URL)
    def tokenize_and_encode(obj):
        """ Tokenize and encode a nested object """
        if isinstance(obj, str):
            return tokenizer.convert_tokens_to_ids(tokenizer.tokenize(obj))
        elif isinstance(obj, int):
            return obj
        return list(tokenize_and_encode(o) for o in obj)
    logger.info("Encoding dataset...")
    train_dataset = load_rocstories_dataset(args.train_dataset)
    eval_dataset = load_rocstories_dataset(args.eval_dataset)
    datasets = (train_dataset, eval_dataset)
    encoded_datasets = tokenize_and_encode(datasets)

    # Compute the max input length for the Transformer
    max_length = model.config.n_positions // 2 - 2
    input_length = max(len(story[:max_length]) + max(len(cont1[:max_length]), len(cont2[:max_length])) + 3  \
                           for dataset in encoded_datasets for story, cont1, cont2, _ in dataset)
    input_length = min(input_length, model.config.n_positions)  # Max size of input for the pre-trained model

    # Prepare inputs tensors and dataloaders
    tensor_datasets = pre_process_datasets(encoded_datasets, input_length, max_length, *special_tokens_ids)
    train_tensor_dataset, eval_tensor_dataset = tensor_datasets[0], tensor_datasets[1]

    train_data = TensorDataset(*train_tensor_dataset)
    train_sampler = RandomSampler(train_data)
    train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=args.train_batch_size)

    eval_data = TensorDataset(*eval_tensor_dataset)
    eval_sampler = SequentialSampler(eval_data)
    eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=args.eval_batch_size)

    # Prepare optimizer
    if args.do_train:
        param_optimizer = list(model.named_parameters())
        no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
        optimizer_grouped_parameters = [
            {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
            {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
            ]
        num_train_optimization_steps = len(train_dataloader) * args.num_train_epochs
        optimizer = OpenAIAdam(optimizer_grouped_parameters,
                               lr=args.learning_rate,
                               warmup=args.warmup_proportion,
                               max_grad_norm=args.max_grad_norm,
                               weight_decay=args.weight_decay,
                               t_total=num_train_optimization_steps)

    if args.do_train:
        nb_tr_steps, tr_loss, exp_average_loss = 0, 0, None
        model.train()
        for _ in trange(int(args.num_train_epochs), desc="Epoch"):
            tr_loss = 0
            nb_tr_steps = 0
            tqdm_bar = tqdm(train_dataloader, desc="Training")
            for step, batch in enumerate(tqdm_bar):
                batch = tuple(t.to(device) for t in batch)
                input_ids, mc_token_ids, lm_labels, mc_labels = batch
                losses = model(input_ids, mc_token_ids, lm_labels, mc_labels)
                loss = args.lm_coef * losses[0] + losses[1]
                loss.backward()
                optimizer.step()
                optimizer.zero_grad()
                tr_loss += loss.item()
                exp_average_loss = loss.item() if exp_average_loss is None else 0.7*exp_average_loss+0.3*loss.item()
                nb_tr_steps += 1
                tqdm_bar.desc = "Training loss: {:.2e} lr: {:.2e}".format(exp_average_loss, optimizer.get_lr()[0])

    # Save a trained model
    if args.do_train:
        # Save a trained model, configuration and tokenizer
        model_to_save = model.module if hasattr(model, 'module') else model  # Only save the model it-self

        # If we save using the predefined names, we can load using `from_pretrained`
        output_model_file = os.path.join(args.output_dir, WEIGHTS_NAME)
        output_config_file = os.path.join(args.output_dir, CONFIG_NAME)

        torch.save(model_to_save.state_dict(), output_model_file)
        model_to_save.config.to_json_file(output_config_file)
        tokenizer.save_vocabul
Download .txt
gitextract_mjpetdbh/

├── LICENSE
├── MANIFEST.in
├── README.md
├── docker/
│   └── Dockerfile
├── examples/
│   ├── bertology.py
│   ├── extract_features.py
│   ├── lm_finetuning/
│   │   ├── README.md
│   │   ├── finetune_on_pregenerated.py
│   │   ├── pregenerate_training_data.py
│   │   └── simple_lm_finetuning.py
│   ├── run_classifier.py
│   ├── run_classifier_dataset_utils.py
│   ├── run_gpt2.py
│   ├── run_openai_gpt.py
│   ├── run_squad.py
│   ├── run_squad_dataset_utils.py
│   ├── run_swag.py
│   ├── run_transfo_xl.py
│   ├── sem_run_classifier.py
│   ├── tacred_run_classifier.py
│   ├── tacred_run_infer.py
│   ├── test.sh
│   └── train.sh
├── hubconf.py
├── hubconfs/
│   ├── bert_hubconf.py
│   ├── gpt2_hubconf.py
│   ├── gpt_hubconf.py
│   └── transformer_xl_hubconf.py
├── notebooks/
│   ├── Comparing-PT-and-TF-models.ipynb
│   ├── Comparing-TF-and-PT-models-MLM-NSP.ipynb
│   ├── Comparing-TF-and-PT-models-SQuAD.ipynb
│   └── Comparing-TF-and-PT-models.ipynb
├── pytorch_pretrained_bert/
│   ├── __init__.py
│   ├── __main__.py
│   ├── convert_gpt2_checkpoint_to_pytorch.py
│   ├── convert_openai_checkpoint_to_pytorch.py
│   ├── convert_pytorch_checkpoint_to_tf.py
│   ├── convert_tf_checkpoint_to_pytorch.py
│   ├── convert_transfo_xl_checkpoint_to_pytorch.py
│   ├── file_utils.py
│   ├── modeling.py
│   ├── modeling_gpt2.py
│   ├── modeling_openai.py
│   ├── modeling_transfo_xl.py
│   ├── modeling_transfo_xl_utilities.py
│   ├── optimization.py
│   ├── optimization_openai.py
│   ├── tokenization.py
│   ├── tokenization_gpt2.py
│   ├── tokenization_openai.py
│   └── tokenization_transfo_xl.py
├── requirements.txt
├── samples/
│   ├── input.txt
│   └── sample_text.txt
├── setup.py
└── tests/
    ├── conftest.py
    ├── modeling_gpt2_test.py
    ├── modeling_openai_test.py
    ├── modeling_test.py
    ├── modeling_transfo_xl_test.py
    ├── optimization_test.py
    ├── tokenization_gpt2_test.py
    ├── tokenization_openai_test.py
    ├── tokenization_test.py
    └── tokenization_transfo_xl_test.py
Download .txt
SYMBOL INDEX (925 symbols across 48 files)

FILE: examples/bertology.py
  function entropy (line 23) | def entropy(p):
  function print_1d_tensor (line 29) | def print_1d_tensor(tensor, prefix=""):
  function print_2d_tensor (line 36) | def print_2d_tensor(tensor):
  function compute_heads_importance (line 42) | def compute_heads_importance(args, model, eval_dataloader, compute_entro...
  function run_model (line 110) | def run_model():

FILE: examples/extract_features.py
  class InputExample (line 40) | class InputExample(object):
    method __init__ (line 42) | def __init__(self, unique_id, text_a, text_b):
  class InputFeatures (line 48) | class InputFeatures(object):
    method __init__ (line 51) | def __init__(self, unique_id, tokens, input_ids, input_mask, input_typ...
  function convert_examples_to_features (line 59) | def convert_examples_to_features(examples, seq_length, tokenizer):
  function _truncate_seq_pair (line 150) | def _truncate_seq_pair(tokens_a, tokens_b, max_length):
  function read_examples (line 167) | def read_examples(input_file):
  function main (line 191) | def main():

FILE: examples/lm_finetuning/finetune_on_pregenerated.py
  function convert_example_to_features (line 27) | def convert_example_to_features(example, tokenizer, max_seq_length):
  class PregeneratedDataset (line 58) | class PregeneratedDataset(Dataset):
    method __init__ (line 59) | def __init__(self, training_path, epoch, tokenizer, num_data_epochs, r...
    method __len__ (line 113) | def __len__(self):
    method __getitem__ (line 116) | def __getitem__(self, item):
  function main (line 124) | def main():

FILE: examples/lm_finetuning/pregenerate_training_data.py
  class DocumentDatabase (line 14) | class DocumentDatabase:
    method __init__ (line 15) | def __init__(self, reduce_memory=False):
    method add_document (line 33) | def add_document(self, document):
    method _precalculate_doc_weights (line 43) | def _precalculate_doc_weights(self):
    method sample_doc (line 47) | def sample_doc(self, current_idx, sentence_weighted=True):
    method __len__ (line 66) | def __len__(self):
    method __getitem__ (line 69) | def __getitem__(self, item):
    method __enter__ (line 75) | def __enter__(self):
    method __exit__ (line 78) | def __exit__(self, exc_type, exc_val, traceback):
  function truncate_seq_pair (line 85) | def truncate_seq_pair(tokens_a, tokens_b, max_num_tokens):
  function create_masked_lm_predictions (line 105) | def create_masked_lm_predictions(tokens, masked_lm_prob, max_predictions...
  function create_instances_from_document (line 170) | def create_instances_from_document(
  function create_training_file (line 268) | def create_training_file(docs, vocab_list, args, epoch_num):
  function main (line 290) | def main():

FILE: examples/lm_finetuning/simple_lm_finetuning.py
  class BERTDataset (line 43) | class BERTDataset(Dataset):
    method __init__ (line 44) | def __init__(self, corpus_path, tokenizer, seq_len, encoding="utf-8", ...
    method __len__ (line 109) | def __len__(self):
    method __getitem__ (line 113) | def __getitem__(self, item):
    method random_sent (line 142) | def random_sent(self, index):
    method get_corpus_line (line 160) | def get_corpus_line(self, item):
    method get_random_line (line 197) | def get_random_line(self):
    method get_next_line (line 220) | def get_next_line(self):
  class InputExample (line 235) | class InputExample(object):
    method __init__ (line 238) | def __init__(self, guid, tokens_a, tokens_b=None, is_next=None, lm_lab...
  class InputFeatures (line 257) | class InputFeatures(object):
    method __init__ (line 260) | def __init__(self, input_ids, input_mask, segment_ids, is_next, lm_lab...
  function random_word (line 268) | def random_word(tokens, tokenizer):
  function convert_example_to_features (line 307) | def convert_example_to_features(example, max_seq_length, tokenizer):
  function main (line 401) | def main():
  function _truncate_seq_pair (line 626) | def _truncate_seq_pair(tokens_a, tokens_b, max_length):
  function accuracy (line 643) | def accuracy(out, labels):

FILE: examples/run_classifier.py
  class InputExample (line 51) | class InputExample(object):
    method __init__ (line 54) | def __init__(self, guid, text_a, text_b=None, label=None, entity_pos=N...
  class InputFeatures (line 72) | class InputFeatures(object):
    method __init__ (line 75) | def __init__(self, input_ids, input_mask, segment_ids, label_id, entit...
  class DataProcessor (line 86) | class DataProcessor(object):
    method get_train_examples (line 89) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 93) | def get_dev_examples(self, data_dir):
    method get_labels (line 97) | def get_labels(self):
    method _read_tsv (line 102) | def _read_tsv(cls, input_file, quotechar=None):
  class MrpcProcessor (line 114) | class MrpcProcessor(DataProcessor):
    method get_train_examples (line 117) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 123) | def get_dev_examples(self, data_dir):
    method get_labels (line 128) | def get_labels(self):
    method _create_examples (line 132) | def _create_examples(self, lines, set_type):
  class SemProcessor (line 146) | class SemProcessor(DataProcessor):
    method get_train_examples (line 149) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 155) | def get_dev_examples(self, data_dir):
    method get_labels (line 160) | def get_labels(self):
    method _create_examples (line 164) | def _create_examples(self, lines, set_type):
  class MnliProcessor (line 179) | class MnliProcessor(DataProcessor):
    method get_train_examples (line 182) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 187) | def get_dev_examples(self, data_dir):
    method get_labels (line 193) | def get_labels(self):
    method _create_examples (line 197) | def _create_examples(self, lines, set_type):
  class MnliMismatchedProcessor (line 212) | class MnliMismatchedProcessor(MnliProcessor):
    method get_dev_examples (line 215) | def get_dev_examples(self, data_dir):
  class ColaProcessor (line 222) | class ColaProcessor(DataProcessor):
    method get_train_examples (line 225) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 230) | def get_dev_examples(self, data_dir):
    method get_labels (line 235) | def get_labels(self):
    method _create_examples (line 239) | def _create_examples(self, lines, set_type):
  class Sst2Processor (line 251) | class Sst2Processor(DataProcessor):
    method get_train_examples (line 254) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 259) | def get_dev_examples(self, data_dir):
    method get_labels (line 264) | def get_labels(self):
    method _create_examples (line 268) | def _create_examples(self, lines, set_type):
  class StsbProcessor (line 282) | class StsbProcessor(DataProcessor):
    method get_train_examples (line 285) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 290) | def get_dev_examples(self, data_dir):
    method get_labels (line 295) | def get_labels(self):
    method _create_examples (line 299) | def _create_examples(self, lines, set_type):
  class QqpProcessor (line 314) | class QqpProcessor(DataProcessor):
    method get_train_examples (line 317) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 322) | def get_dev_examples(self, data_dir):
    method get_labels (line 327) | def get_labels(self):
    method _create_examples (line 331) | def _create_examples(self, lines, set_type):
  class QnliProcessor (line 349) | class QnliProcessor(DataProcessor):
    method get_train_examples (line 352) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 357) | def get_dev_examples(self, data_dir):
    method get_labels (line 363) | def get_labels(self):
    method _create_examples (line 367) | def _create_examples(self, lines, set_type):
  class RteProcessor (line 382) | class RteProcessor(DataProcessor):
    method get_train_examples (line 385) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 390) | def get_dev_examples(self, data_dir):
    method get_labels (line 395) | def get_labels(self):
    method _create_examples (line 399) | def _create_examples(self, lines, set_type):
  class WnliProcessor (line 414) | class WnliProcessor(DataProcessor):
    method get_train_examples (line 417) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 422) | def get_dev_examples(self, data_dir):
    method get_labels (line 427) | def get_labels(self):
    method _create_examples (line 431) | def _create_examples(self, lines, set_type):
  function convert_examples_to_features (line 446) | def convert_examples_to_features(examples, label_list, max_seq_length,
  function _truncate_seq_pair (line 650) | def _truncate_seq_pair(tokens_a, tokens_b, max_length):
  function simple_accuracy (line 667) | def simple_accuracy(preds, labels):
  function acc_and_f1 (line 671) | def acc_and_f1(preds, labels):
  function pearson_and_spearman (line 683) | def pearson_and_spearman(preds, labels):
  function compute_metrics (line 693) | def compute_metrics(task_name, preds, labels):
  function main (line 721) | def main():

FILE: examples/run_classifier_dataset_utils.py
  class InputExample (line 31) | class InputExample(object):
    method __init__ (line 34) | def __init__(self, guid, text_a, text_b=None, label=None):
  class InputFeatures (line 52) | class InputFeatures(object):
    method __init__ (line 55) | def __init__(self, input_ids, input_mask, segment_ids, label_id):
  class DataProcessor (line 62) | class DataProcessor(object):
    method get_train_examples (line 65) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 69) | def get_dev_examples(self, data_dir):
    method get_labels (line 73) | def get_labels(self):
    method _read_tsv (line 78) | def _read_tsv(cls, input_file, quotechar=None):
  class MrpcProcessor (line 90) | class MrpcProcessor(DataProcessor):
    method get_train_examples (line 93) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 99) | def get_dev_examples(self, data_dir):
    method get_labels (line 104) | def get_labels(self):
    method _create_examples (line 108) | def _create_examples(self, lines, set_type):
  class MnliProcessor (line 123) | class MnliProcessor(DataProcessor):
    method get_train_examples (line 126) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 131) | def get_dev_examples(self, data_dir):
    method get_labels (line 137) | def get_labels(self):
    method _create_examples (line 141) | def _create_examples(self, lines, set_type):
  class MnliMismatchedProcessor (line 156) | class MnliMismatchedProcessor(MnliProcessor):
    method get_dev_examples (line 159) | def get_dev_examples(self, data_dir):
  class ColaProcessor (line 166) | class ColaProcessor(DataProcessor):
    method get_train_examples (line 169) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 174) | def get_dev_examples(self, data_dir):
    method get_labels (line 179) | def get_labels(self):
    method _create_examples (line 183) | def _create_examples(self, lines, set_type):
  class Sst2Processor (line 195) | class Sst2Processor(DataProcessor):
    method get_train_examples (line 198) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 203) | def get_dev_examples(self, data_dir):
    method get_labels (line 208) | def get_labels(self):
    method _create_examples (line 212) | def _create_examples(self, lines, set_type):
  class StsbProcessor (line 226) | class StsbProcessor(DataProcessor):
    method get_train_examples (line 229) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 234) | def get_dev_examples(self, data_dir):
    method get_labels (line 239) | def get_labels(self):
    method _create_examples (line 243) | def _create_examples(self, lines, set_type):
  class QqpProcessor (line 258) | class QqpProcessor(DataProcessor):
    method get_train_examples (line 261) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 266) | def get_dev_examples(self, data_dir):
    method get_labels (line 271) | def get_labels(self):
    method _create_examples (line 275) | def _create_examples(self, lines, set_type):
  class QnliProcessor (line 293) | class QnliProcessor(DataProcessor):
    method get_train_examples (line 296) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 301) | def get_dev_examples(self, data_dir):
    method get_labels (line 307) | def get_labels(self):
    method _create_examples (line 311) | def _create_examples(self, lines, set_type):
  class RteProcessor (line 326) | class RteProcessor(DataProcessor):
    method get_train_examples (line 329) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 334) | def get_dev_examples(self, data_dir):
    method get_labels (line 339) | def get_labels(self):
    method _create_examples (line 343) | def _create_examples(self, lines, set_type):
  class WnliProcessor (line 358) | class WnliProcessor(DataProcessor):
    method get_train_examples (line 361) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 366) | def get_dev_examples(self, data_dir):
    method get_labels (line 371) | def get_labels(self):
    method _create_examples (line 375) | def _create_examples(self, lines, set_type):
  function convert_examples_to_features (line 390) | def convert_examples_to_features(examples, label_list, max_seq_length,
  function _truncate_seq_pair (line 482) | def _truncate_seq_pair(tokens_a, tokens_b, max_length):
  function simple_accuracy (line 499) | def simple_accuracy(preds, labels):
  function acc_and_f1 (line 503) | def acc_and_f1(preds, labels):
  function pearson_and_spearman (line 513) | def pearson_and_spearman(preds, labels):
  function compute_metrics (line 523) | def compute_metrics(task_name, preds, labels):

FILE: examples/run_gpt2.py
  function top_k_logits (line 18) | def top_k_logits(logits, k):
  function sample_sequence (line 31) | def sample_sequence(model, length, start_token=None, batch_size=None, co...
  function run_model (line 54) | def run_model():

FILE: examples/run_openai_gpt.py
  function accuracy (line 52) | def accuracy(out, labels):
  function load_rocstories_dataset (line 56) | def load_rocstories_dataset(dataset_path):
  function pre_process_datasets (line 66) | def pre_process_datasets(encoded_datasets, input_len, cap_length, start_...
  function main (line 93) | def main():

FILE: examples/run_squad.py
  function main (line 51) | def main():

FILE: examples/run_squad_dataset_utils.py
  class SquadExample (line 31) | class SquadExample(object):
    method __init__ (line 37) | def __init__(self,
    method __str__ (line 53) | def __str__(self):
    method __repr__ (line 56) | def __repr__(self):
  class InputFeatures (line 71) | class InputFeatures(object):
    method __init__ (line 74) | def __init__(self,
  function read_squad_examples (line 101) | def read_squad_examples(input_file, is_training, version_2_with_negative):
  function convert_examples_to_features (line 179) | def convert_examples_to_features(examples, tokenizer, max_seq_length,
  function _improve_answer_span (line 342) | def _improve_answer_span(doc_tokens, input_start, input_end, tokenizer,
  function _check_is_max_context (line 379) | def _check_is_max_context(doc_spans, cur_span_index, position):
  function write_predictions (line 420) | def write_predictions(all_examples, all_features, all_results, n_best_size,
  function get_final_text (line 612) | def get_final_text(pred_text, orig_text, do_lower_case, verbose_logging=...
  function _get_best_indexes (line 708) | def _get_best_indexes(logits, n_best_size):
  function _compute_softmax (line 720) | def _compute_softmax(scores):

FILE: examples/run_swag.py
  class SwagExample (line 46) | class SwagExample(object):
    method __init__ (line 48) | def __init__(self,
    method __str__ (line 68) | def __str__(self):
    method __repr__ (line 71) | def __repr__(self):
  class InputFeatures (line 88) | class InputFeatures(object):
    method __init__ (line 89) | def __init__(self,
  function read_swag_examples (line 107) | def read_swag_examples(input_file, is_training):
  function convert_examples_to_features (line 138) | def convert_examples_to_features(examples, tokenizer, max_seq_length,
  function _truncate_seq_pair (line 216) | def _truncate_seq_pair(tokens_a, tokens_b, max_length):
  function accuracy (line 232) | def accuracy(out, labels):
  function select_field (line 236) | def select_field(features, field):
  function main (line 245) | def main():

FILE: examples/run_transfo_xl.py
  function main (line 38) | def main():

FILE: examples/sem_run_classifier.py
  class InputExample (line 51) | class InputExample(object):
    method __init__ (line 54) | def __init__(self, guid, text_a, text_b=None, label=None, entity_pos=N...
  class InputFeatures (line 72) | class InputFeatures(object):
    method __init__ (line 75) | def __init__(self, input_ids, input_mask, segment_ids, label_id, entit...
  class DataProcessor (line 86) | class DataProcessor(object):
    method get_train_examples (line 89) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 93) | def get_dev_examples(self, data_dir):
    method get_labels (line 97) | def get_labels(self):
    method _read_tsv (line 102) | def _read_tsv(cls, input_file, quotechar=None):
  class MrpcProcessor (line 114) | class MrpcProcessor(DataProcessor):
    method get_train_examples (line 117) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 123) | def get_dev_examples(self, data_dir):
    method get_labels (line 128) | def get_labels(self):
    method _create_examples (line 132) | def _create_examples(self, lines, set_type):
  class SemProcessor (line 146) | class SemProcessor(DataProcessor):
    method get_train_examples (line 149) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 155) | def get_dev_examples(self, data_dir):
    method get_labels (line 160) | def get_labels(self):
    method _create_examples (line 164) | def _create_examples(self, lines, set_type):
  class MnliProcessor (line 179) | class MnliProcessor(DataProcessor):
    method get_train_examples (line 182) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 187) | def get_dev_examples(self, data_dir):
    method get_labels (line 193) | def get_labels(self):
    method _create_examples (line 197) | def _create_examples(self, lines, set_type):
  class MnliMismatchedProcessor (line 212) | class MnliMismatchedProcessor(MnliProcessor):
    method get_dev_examples (line 215) | def get_dev_examples(self, data_dir):
  class ColaProcessor (line 222) | class ColaProcessor(DataProcessor):
    method get_train_examples (line 225) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 230) | def get_dev_examples(self, data_dir):
    method get_labels (line 235) | def get_labels(self):
    method _create_examples (line 239) | def _create_examples(self, lines, set_type):
  class Sst2Processor (line 251) | class Sst2Processor(DataProcessor):
    method get_train_examples (line 254) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 259) | def get_dev_examples(self, data_dir):
    method get_labels (line 264) | def get_labels(self):
    method _create_examples (line 268) | def _create_examples(self, lines, set_type):
  class StsbProcessor (line 282) | class StsbProcessor(DataProcessor):
    method get_train_examples (line 285) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 290) | def get_dev_examples(self, data_dir):
    method get_labels (line 295) | def get_labels(self):
    method _create_examples (line 299) | def _create_examples(self, lines, set_type):
  class QqpProcessor (line 314) | class QqpProcessor(DataProcessor):
    method get_train_examples (line 317) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 322) | def get_dev_examples(self, data_dir):
    method get_labels (line 327) | def get_labels(self):
    method _create_examples (line 331) | def _create_examples(self, lines, set_type):
  class QnliProcessor (line 349) | class QnliProcessor(DataProcessor):
    method get_train_examples (line 352) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 357) | def get_dev_examples(self, data_dir):
    method get_labels (line 363) | def get_labels(self):
    method _create_examples (line 367) | def _create_examples(self, lines, set_type):
  class RteProcessor (line 382) | class RteProcessor(DataProcessor):
    method get_train_examples (line 385) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 390) | def get_dev_examples(self, data_dir):
    method get_labels (line 395) | def get_labels(self):
    method _create_examples (line 399) | def _create_examples(self, lines, set_type):
  class WnliProcessor (line 414) | class WnliProcessor(DataProcessor):
    method get_train_examples (line 417) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 422) | def get_dev_examples(self, data_dir):
    method get_labels (line 427) | def get_labels(self):
    method _create_examples (line 431) | def _create_examples(self, lines, set_type):
  function convert_examples_to_features (line 446) | def convert_examples_to_features(examples, label_list, max_seq_length,
  function _truncate_seq_pair (line 650) | def _truncate_seq_pair(tokens_a, tokens_b, max_length):
  function simple_accuracy (line 667) | def simple_accuracy(preds, labels):
  function acc_and_f1 (line 671) | def acc_and_f1(preds, labels):
  function pearson_and_spearman (line 683) | def pearson_and_spearman(preds, labels):
  function compute_metrics (line 693) | def compute_metrics(task_name, preds, labels):
  function main (line 721) | def main():

FILE: examples/tacred_run_classifier.py
  class InputExample (line 51) | class InputExample(object):
    method __init__ (line 54) | def __init__(self, guid, text_a, text_b=None, label=None, entity_pos=N...
  class InputFeatures (line 72) | class InputFeatures(object):
    method __init__ (line 75) | def __init__(self, input_ids, input_mask, segment_ids, label_id, entit...
  class DataProcessor (line 86) | class DataProcessor(object):
    method get_train_examples (line 89) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 93) | def get_dev_examples(self, data_dir):
    method get_labels (line 97) | def get_labels(self):
    method _read_tsv (line 102) | def _read_tsv(cls, input_file, quotechar=None):
  class MrpcProcessor (line 114) | class MrpcProcessor(DataProcessor):
    method get_train_examples (line 117) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 123) | def get_dev_examples(self, data_dir):
    method get_labels (line 128) | def get_labels(self):
    method _create_examples (line 132) | def _create_examples(self, lines, set_type):
  class SemProcessor (line 146) | class SemProcessor(DataProcessor):
    method get_train_examples (line 149) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 155) | def get_dev_examples(self, data_dir):
    method get_labels (line 160) | def get_labels(self):
    method _create_examples (line 164) | def _create_examples(self, lines, set_type):
  class TacredProcessor (line 177) | class TacredProcessor(DataProcessor):
    method get_train_examples (line 180) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 186) | def get_dev_examples(self, data_dir):
    method get_test_examples (line 191) | def get_test_examples(self, data_dir):
    method get_labels (line 196) | def get_labels(self):
    method _create_examples (line 199) | def _create_examples(self, lines, set_type):
  class MnliProcessor (line 215) | class MnliProcessor(DataProcessor):
    method get_train_examples (line 218) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 223) | def get_dev_examples(self, data_dir):
    method get_labels (line 229) | def get_labels(self):
    method _create_examples (line 233) | def _create_examples(self, lines, set_type):
  class MnliMismatchedProcessor (line 248) | class MnliMismatchedProcessor(MnliProcessor):
    method get_dev_examples (line 251) | def get_dev_examples(self, data_dir):
  class ColaProcessor (line 258) | class ColaProcessor(DataProcessor):
    method get_train_examples (line 261) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 266) | def get_dev_examples(self, data_dir):
    method get_labels (line 271) | def get_labels(self):
    method _create_examples (line 275) | def _create_examples(self, lines, set_type):
  class Sst2Processor (line 287) | class Sst2Processor(DataProcessor):
    method get_train_examples (line 290) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 295) | def get_dev_examples(self, data_dir):
    method get_labels (line 300) | def get_labels(self):
    method _create_examples (line 304) | def _create_examples(self, lines, set_type):
  class StsbProcessor (line 318) | class StsbProcessor(DataProcessor):
    method get_train_examples (line 321) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 326) | def get_dev_examples(self, data_dir):
    method get_labels (line 331) | def get_labels(self):
    method _create_examples (line 335) | def _create_examples(self, lines, set_type):
  class QqpProcessor (line 350) | class QqpProcessor(DataProcessor):
    method get_train_examples (line 353) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 358) | def get_dev_examples(self, data_dir):
    method get_labels (line 363) | def get_labels(self):
    method _create_examples (line 367) | def _create_examples(self, lines, set_type):
  class QnliProcessor (line 385) | class QnliProcessor(DataProcessor):
    method get_train_examples (line 388) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 393) | def get_dev_examples(self, data_dir):
    method get_labels (line 399) | def get_labels(self):
    method _create_examples (line 403) | def _create_examples(self, lines, set_type):
  class RteProcessor (line 418) | class RteProcessor(DataProcessor):
    method get_train_examples (line 421) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 426) | def get_dev_examples(self, data_dir):
    method get_labels (line 431) | def get_labels(self):
    method _create_examples (line 435) | def _create_examples(self, lines, set_type):
  class WnliProcessor (line 450) | class WnliProcessor(DataProcessor):
    method get_train_examples (line 453) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 458) | def get_dev_examples(self, data_dir):
    method get_labels (line 463) | def get_labels(self):
    method _create_examples (line 467) | def _create_examples(self, lines, set_type):
  function convert_examples_to_features (line 481) | def convert_examples_to_features(examples, label_list, max_seq_length,
  function _truncate_seq_pair (line 686) | def _truncate_seq_pair(tokens_a, tokens_b, max_length):
  function simple_accuracy (line 703) | def simple_accuracy(preds, labels):
  function acc_and_f1 (line 707) | def acc_and_f1(preds, labels):
  function pearson_and_spearman (line 721) | def pearson_and_spearman(preds, labels):
  function compute_metrics (line 731) | def compute_metrics(task_name, preds, labels):
  function main (line 761) | def main():

FILE: examples/tacred_run_infer.py
  class InputExample (line 39) | class InputExample(object):
    method __init__ (line 42) | def __init__(self, guid, text_a, text_b=None, label=None, entity_pos=N...
  class InputFeatures (line 60) | class InputFeatures(object):
    method __init__ (line 63) | def __init__(self,input_ids, input_mask, segment_ids, label_id, entity...
  class DataProcessor (line 74) | class DataProcessor(object):
    method get_train_examples (line 77) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 81) | def get_dev_examples(self, data_dir):
    method get_labels (line 85) | def get_labels(self):
    method _read_tsv (line 90) | def _read_tsv(cls, input_file, quotechar=None):
  class TacredProcessor (line 101) | class TacredProcessor(DataProcessor):
    method get_train_examples (line 104) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 110) | def get_dev_examples(self, data_dir):
    method get_test_examples (line 115) | def get_test_examples(self, data_dir):
    method get_labels (line 120) | def get_labels(self):
    method _create_examples (line 123) | def _create_examples(self, lines, set_type):
  class _TacredProcessor (line 138) | class _TacredProcessor(DataProcessor):
    method get_test_examples (line 141) | def get_test_examples(self, lines):
    method get_labels (line 145) | def get_labels(self):
    method _create_examples (line 149) | def _create_examples(self, lines, set_type):
  function convert_examples_to_features (line 164) | def convert_examples_to_features(examples, label_list, max_seq_length,
  function _truncate_seq_pair (line 359) | def _truncate_seq_pair(tokens_a, tokens_b, max_length):
  function load_model (line 375) | def load_model():
  function get_helper_model (line 453) | def get_helper_model(spacy_used=False):
  function predict (line 466) | def predict():

FILE: hubconfs/bert_hubconf.py
  function _append_from_pretrained_docstring (line 48) | def _append_from_pretrained_docstring(docstr):
  function bertTokenizer (line 55) | def bertTokenizer(*args, **kwargs):
  function bertModel (line 100) | def bertModel(*args, **kwargs):
  function bertForNextSentencePrediction (line 129) | def bertForNextSentencePrediction(*args, **kwargs):
  function bertForPreTraining (line 158) | def bertForPreTraining(*args, **kwargs):
  function bertForMaskedLM (line 184) | def bertForMaskedLM(*args, **kwargs):
  function bertForSequenceClassification (line 217) | def bertForSequenceClassification(*args, **kwargs):
  function bertForMultipleChoice (line 257) | def bertForMultipleChoice(*args, **kwargs):
  function bertForQuestionAnswering (line 292) | def bertForQuestionAnswering(*args, **kwargs):
  function bertForTokenClassification (line 326) | def bertForTokenClassification(*args, **kwargs):

FILE: hubconfs/gpt2_hubconf.py
  function _append_from_pretrained_docstring (line 28) | def _append_from_pretrained_docstring(docstr):
  function gpt2Tokenizer (line 35) | def gpt2Tokenizer(*args, **kwargs):
  function gpt2Model (line 66) | def gpt2Model(*args, **kwargs):
  function gpt2LMHeadModel (line 100) | def gpt2LMHeadModel(*args, **kwargs):
  function gpt2DoubleHeadsModel (line 138) | def gpt2DoubleHeadsModel(*args, **kwargs):

FILE: hubconfs/gpt_hubconf.py
  function _append_from_pretrained_docstring (line 49) | def _append_from_pretrained_docstring(docstr):
  function openAIGPTTokenizer (line 56) | def openAIGPTTokenizer(*args, **kwargs):
  function openAIGPTModel (line 92) | def openAIGPTModel(*args, **kwargs):
  function openAIGPTLMHeadModel (line 122) | def openAIGPTLMHeadModel(*args, **kwargs):
  function openAIGPTDoubleHeadsModel (line 156) | def openAIGPTDoubleHeadsModel(*args, **kwargs):

FILE: hubconfs/transformer_xl_hubconf.py
  function _append_from_pretrained_docstring (line 31) | def _append_from_pretrained_docstring(docstr):
  function transformerXLTokenizer (line 38) | def transformerXLTokenizer(*args, **kwargs):
  function transformerXLModel (line 60) | def transformerXLModel(*args, **kwargs):
  function transformerXLLMHeadModel (line 94) | def transformerXLLMHeadModel(*args, **kwargs):

FILE: pytorch_pretrained_bert/__main__.py
  function main (line 2) | def main():

FILE: pytorch_pretrained_bert/convert_gpt2_checkpoint_to_pytorch.py
  function convert_gpt2_checkpoint_to_pytorch (line 30) | def convert_gpt2_checkpoint_to_pytorch(gpt2_checkpoint_path, gpt2_config...

FILE: pytorch_pretrained_bert/convert_openai_checkpoint_to_pytorch.py
  function convert_openai_checkpoint_to_pytorch (line 30) | def convert_openai_checkpoint_to_pytorch(openai_checkpoint_folder_path, ...

FILE: pytorch_pretrained_bert/convert_pytorch_checkpoint_to_tf.py
  function convert_pytorch_checkpoint_to_tf (line 26) | def convert_pytorch_checkpoint_to_tf(model:BertModel, ckpt_dir:str, mode...
  function main (line 95) | def main(raw_args=None):

FILE: pytorch_pretrained_bert/convert_tf_checkpoint_to_pytorch.py
  function convert_tf_checkpoint_to_pytorch (line 30) | def convert_tf_checkpoint_to_pytorch(tf_checkpoint_path, bert_config_fil...

FILE: pytorch_pretrained_bert/convert_transfo_xl_checkpoint_to_pytorch.py
  function convert_transfo_xl_checkpoint_to_pytorch (line 47) | def convert_transfo_xl_checkpoint_to_pytorch(tf_checkpoint_path,

FILE: pytorch_pretrained_bert/file_utils.py
  function url_to_filename (line 53) | def url_to_filename(url, etag=None):
  function filename_to_url (line 71) | def filename_to_url(filename, cache_dir=None):
  function cached_path (line 97) | def cached_path(url_or_filename, cache_dir=None):
  function split_s3_path (line 127) | def split_s3_path(url):
  function s3_request (line 140) | def s3_request(func):
  function s3_etag (line 160) | def s3_etag(url):
  function s3_get (line 169) | def s3_get(url, temp_file):
  function http_get (line 176) | def http_get(url, temp_file):
  function get_from_cache (line 188) | def get_from_cache(url, cache_dir=None):
  function read_set_from_file (line 264) | def read_set_from_file(filename):
  function get_file_extension (line 276) | def get_file_extension(path, dot=True, lower=True):

FILE: pytorch_pretrained_bert/modeling.py
  function load_tf_weights_in_bert (line 51) | def load_tf_weights_in_bert(model, tf_checkpoint_path):
  function gelu (line 118) | def gelu(x):
  function swish (line 127) | def swish(x):
  class BertConfig (line 134) | class BertConfig(object):
    method __init__ (line 137) | def __init__(self,
    method from_dict (line 199) | def from_dict(cls, json_object):
    method from_json_file (line 207) | def from_json_file(cls, json_file):
    method __repr__ (line 213) | def __repr__(self):
    method to_dict (line 216) | def to_dict(self):
    method to_json_string (line 221) | def to_json_string(self):
    method to_json_file (line 225) | def to_json_file(self, json_file_path):
  class BertLayerNorm (line 234) | class BertLayerNorm(nn.Module):
    method __init__ (line 235) | def __init__(self, hidden_size, eps=1e-12):
    method forward (line 243) | def forward(self, x):
  class BertEmbeddings (line 249) | class BertEmbeddings(nn.Module):
    method __init__ (line 252) | def __init__(self, config):
    method forward (line 263) | def forward(self, input_ids, entity_pos_seg=None, entity_span1_pos=Non...
  class BertSelfAttention (line 327) | class BertSelfAttention(nn.Module):
    method __init__ (line 328) | def __init__(self, config):
    method transpose_for_scores (line 344) | def transpose_for_scores(self, x):
    method forward (line 349) | def forward(self, hidden_states, attention_mask):
  class BertSelfOutput (line 378) | class BertSelfOutput(nn.Module):
    method __init__ (line 379) | def __init__(self, config):
    method forward (line 385) | def forward(self, hidden_states, input_tensor):
  class BertAttention (line 392) | class BertAttention(nn.Module):
    method __init__ (line 393) | def __init__(self, config):
    method forward (line 398) | def forward(self, input_tensor, attention_mask):
  class BertIntermediate (line 404) | class BertIntermediate(nn.Module):
    method __init__ (line 405) | def __init__(self, config):
    method forward (line 413) | def forward(self, hidden_states):
  class BertOutput (line 419) | class BertOutput(nn.Module):
    method __init__ (line 420) | def __init__(self, config):
    method forward (line 426) | def forward(self, hidden_states, input_tensor):
  class BertLayer (line 433) | class BertLayer(nn.Module):
    method __init__ (line 434) | def __init__(self, config):
    method forward (line 440) | def forward(self, hidden_states, attention_mask):
  class BertEncoder (line 447) | class BertEncoder(nn.Module):
    method __init__ (line 448) | def __init__(self, config):
    method forward (line 453) | def forward(self, hidden_states, attention_mask, output_all_encoded_la...
  class BertPooler (line 464) | class BertPooler(nn.Module):
    method __init__ (line 465) | def __init__(self, config):
    method forward (line 470) | def forward(self, hidden_states):
  class BertPredictionHeadTransform (line 479) | class BertPredictionHeadTransform(nn.Module):
    method __init__ (line 480) | def __init__(self, config):
    method forward (line 489) | def forward(self, hidden_states):
  class BertLMPredictionHead (line 496) | class BertLMPredictionHead(nn.Module):
    method __init__ (line 497) | def __init__(self, config, bert_model_embedding_weights):
    method forward (line 509) | def forward(self, hidden_states):
  class BertOnlyMLMHead (line 515) | class BertOnlyMLMHead(nn.Module):
    method __init__ (line 516) | def __init__(self, config, bert_model_embedding_weights):
    method forward (line 520) | def forward(self, sequence_output):
  class BertOnlyNSPHead (line 525) | class BertOnlyNSPHead(nn.Module):
    method __init__ (line 526) | def __init__(self, config):
    method forward (line 530) | def forward(self, pooled_output):
  class BertPreTrainingHeads (line 535) | class BertPreTrainingHeads(nn.Module):
    method __init__ (line 536) | def __init__(self, config, bert_model_embedding_weights):
    method forward (line 541) | def forward(self, sequence_output, pooled_output):
  class BertPreTrainedModel (line 547) | class BertPreTrainedModel(nn.Module):
    method __init__ (line 551) | def __init__(self, config, *inputs, **kwargs):
    method init_bert_weights (line 562) | def init_bert_weights(self, module):
    method from_pretrained (line 576) | def from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwa...
  class BertModel (line 708) | class BertModel(BertPreTrainedModel):
    method __init__ (line 752) | def __init__(self, config):
    method forward (line 759) | def forward(self, input_ids, entity_seg_pos = None, entity_span1_pos=N...
  class BertForPreTraining (line 795) | class BertForPreTraining(BertPreTrainedModel):
    method __init__ (line 845) | def __init__(self, config):
    method forward (line 851) | def forward(self, input_ids, token_type_ids=None, attention_mask=None,...
  class BertForMaskedLM (line 866) | class BertForMaskedLM(BertPreTrainedModel):
    method __init__ (line 908) | def __init__(self, config):
    method forward (line 914) | def forward(self, input_ids, token_type_ids=None, attention_mask=None,...
  class BertForNextSentencePrediction (line 927) | class BertForNextSentencePrediction(BertPreTrainedModel):
    method __init__ (line 970) | def __init__(self, config):
    method forward (line 976) | def forward(self, input_ids, token_type_ids=None, attention_mask=None,...
  class BertForSequenceClassification (line 989) | class BertForSequenceClassification(BertPreTrainedModel):
    method __init__ (line 1034) | def __init__(self, config, num_labels):
    method forward (line 1054) | def forward(self, input_ids, token_type_ids=None, attention_mask=None,...
  class BertForMultipleChoice (line 1137) | class BertForMultipleChoice(BertPreTrainedModel):
    method __init__ (line 1181) | def __init__(self, config, num_choices):
    method forward (line 1189) | def forward(self, input_ids, token_type_ids=None, attention_mask=None,...
  class BertForTokenClassification (line 1206) | class BertForTokenClassification(BertPreTrainedModel):
    method __init__ (line 1251) | def __init__(self, config, num_labels):
    method forward (line 1259) | def forward(self, input_ids, token_type_ids=None, attention_mask=None,...
  class BertForQuestionAnswering (line 1279) | class BertForQuestionAnswering(BertPreTrainedModel):
    method __init__ (line 1326) | def __init__(self, config):
    method forward (line 1334) | def forward(self, input_ids, token_type_ids=None, attention_mask=None,...

FILE: pytorch_pretrained_bert/modeling_gpt2.py
  function prune_conv1d_layer (line 44) | def prune_conv1d_layer(layer, index, dim=1):
  function load_tf_weights_in_gpt2 (line 68) | def load_tf_weights_in_gpt2(model, gpt2_checkpoint_path):
  function gelu (line 122) | def gelu(x):
  class GPT2Config (line 126) | class GPT2Config(object):
    method __init__ (line 130) | def __init__(
    method total_tokens_embeddings (line 194) | def total_tokens_embeddings(self):
    method from_dict (line 198) | def from_dict(cls, json_object):
    method from_json_file (line 206) | def from_json_file(cls, json_file):
    method __repr__ (line 212) | def __repr__(self):
    method to_dict (line 215) | def to_dict(self):
    method to_json_string (line 220) | def to_json_string(self):
    method to_json_file (line 224) | def to_json_file(self, json_file_path):
  class Conv1D (line 230) | class Conv1D(nn.Module):
    method __init__ (line 231) | def __init__(self, nf, nx):
    method forward (line 239) | def forward(self, x):
  class Attention (line 246) | class Attention(nn.Module):
    method __init__ (line 247) | def __init__(self, nx, n_ctx, config, scale=False, output_attentions=F...
    method prune_heads (line 266) | def prune_heads(self, heads):
    method _attn (line 282) | def _attn(self, q, k, v, head_mask=None):
    method merge_heads (line 301) | def merge_heads(self, x):
    method split_heads (line 306) | def split_heads(self, x, k=False):
    method forward (line 314) | def forward(self, x, layer_past=None, head_mask=None):
  class MLP (line 341) | class MLP(nn.Module):
    method __init__ (line 342) | def __init__(self, n_state, config):  # in MLP: n_state=3072 (4 * n_embd)
    method forward (line 350) | def forward(self, x):
  class Block (line 356) | class Block(nn.Module):
    method __init__ (line 357) | def __init__(self, n_ctx, config, scale=False, output_attentions=False...
    method forward (line 366) | def forward(self, x, layer_past=None, head_mask=None):
  class GPT2LMHead (line 380) | class GPT2LMHead(nn.Module):
    method __init__ (line 383) | def __init__(self, model_embeddings_weights, config):
    method set_embeddings_weights (line 392) | def set_embeddings_weights(self, model_embeddings_weights, predict_spe...
    method forward (line 396) | def forward(self, hidden_state):
  class GPT2MultipleChoiceHead (line 403) | class GPT2MultipleChoiceHead(nn.Module):
    method __init__ (line 406) | def __init__(self, config):
    method forward (line 415) | def forward(self, hidden_states, mc_token_ids):
  class GPT2PreTrainedModel (line 429) | class GPT2PreTrainedModel(nn.Module):
    method __init__ (line 434) | def __init__(self, config, *inputs, **kwargs):
    method init_weights (line 446) | def init_weights(self, module):
    method from_pretrained (line 460) | def from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwa...
  class GPT2Model (line 607) | class GPT2Model(GPT2PreTrainedModel):
    method __init__ (line 668) | def __init__(self, config, output_attentions=False, keep_multihead_out...
    method set_num_special_tokens (line 681) | def set_num_special_tokens(self, num_special_tokens):
    method prune_heads (line 695) | def prune_heads(self, heads_to_prune):
    method get_multihead_outputs (line 702) | def get_multihead_outputs(self):
    method forward (line 708) | def forward(self, input_ids, position_ids=None, token_type_ids=None, p...
  class GPT2LMHeadModel (line 768) | class GPT2LMHeadModel(GPT2PreTrainedModel):
    method __init__ (line 817) | def __init__(self, config, output_attentions=False, keep_multihead_out...
    method set_num_special_tokens (line 824) | def set_num_special_tokens(self, num_special_tokens, predict_special_t...
    method forward (line 832) | def forward(self, input_ids, position_ids=None, token_type_ids=None, l...
  class GPT2DoubleHeadsModel (line 855) | class GPT2DoubleHeadsModel(GPT2PreTrainedModel):
    method __init__ (line 909) | def __init__(self, config, output_attentions=False, keep_multihead_out...
    method set_num_special_tokens (line 917) | def set_num_special_tokens(self, num_special_tokens, predict_special_t...
    method forward (line 925) | def forward(self, input_ids, mc_token_ids, lm_labels=None, mc_labels=N...

FILE: pytorch_pretrained_bert/modeling_openai.py
  function load_tf_weights_in_openai_gpt (line 44) | def load_tf_weights_in_openai_gpt(model, openai_checkpoint_folder_path):
  function gelu (line 114) | def gelu(x):
  function swish (line 118) | def swish(x):
  class OpenAIGPTConfig (line 125) | class OpenAIGPTConfig(object):
    method __init__ (line 129) | def __init__(
    method total_tokens_embeddings (line 197) | def total_tokens_embeddings(self):
    method from_dict (line 201) | def from_dict(cls, json_object):
    method from_json_file (line 209) | def from_json_file(cls, json_file):
    method __repr__ (line 215) | def __repr__(self):
    method to_dict (line 218) | def to_dict(self):
    method to_json_string (line 223) | def to_json_string(self):
    method to_json_file (line 227) | def to_json_file(self, json_file_path):
  class Conv1D (line 233) | class Conv1D(nn.Module):
    method __init__ (line 234) | def __init__(self, nf, rf, nx):
    method forward (line 246) | def forward(self, x):
  class Attention (line 256) | class Attention(nn.Module):
    method __init__ (line 257) | def __init__(self, nx, n_ctx, config, scale=False, output_attentions=F...
    method prune_heads (line 276) | def prune_heads(self, heads):
    method _attn (line 292) | def _attn(self, q, k, v, head_mask=None):
    method merge_heads (line 312) | def merge_heads(self, x):
    method split_heads (line 317) | def split_heads(self, x, k=False):
    method forward (line 325) | def forward(self, x, head_mask=None):
  class MLP (line 347) | class MLP(nn.Module):
    method __init__ (line 348) | def __init__(self, n_state, config):  # in MLP: n_state=3072 (4 * n_embd)
    method forward (line 356) | def forward(self, x):
  class Block (line 362) | class Block(nn.Module):
    method __init__ (line 363) | def __init__(self, n_ctx, config, scale=False, output_attentions=False...
    method forward (line 372) | def forward(self, x, head_mask=None):
  class OpenAIGPTLMHead (line 384) | class OpenAIGPTLMHead(nn.Module):
    method __init__ (line 387) | def __init__(self, model_embeddings_weights, config):
    method set_embeddings_weights (line 396) | def set_embeddings_weights(self, model_embeddings_weights, predict_spe...
    method forward (line 401) | def forward(self, hidden_state):
  class OpenAIGPTMultipleChoiceHead (line 408) | class OpenAIGPTMultipleChoiceHead(nn.Module):
    method __init__ (line 411) | def __init__(self, config):
    method forward (line 420) | def forward(self, hidden_states, mc_token_ids):
  class OpenAIGPTPreTrainedModel (line 434) | class OpenAIGPTPreTrainedModel(nn.Module):
    method __init__ (line 439) | def __init__(self, config, *inputs, **kwargs):
    method init_weights (line 451) | def init_weights(self, module):
    method from_pretrained (line 465) | def from_pretrained(cls, pretrained_model_name_or_path, num_special_to...
  class OpenAIGPTModel (line 610) | class OpenAIGPTModel(OpenAIGPTPreTrainedModel):
    method __init__ (line 666) | def __init__(self, config, output_attentions=False, keep_multihead_out...
    method set_num_special_tokens (line 678) | def set_num_special_tokens(self, num_special_tokens):
    method prune_heads (line 692) | def prune_heads(self, heads_to_prune):
    method get_multihead_outputs (line 699) | def get_multihead_outputs(self):
    method forward (line 705) | def forward(self, input_ids, position_ids=None, token_type_ids=None, h...
  class OpenAIGPTLMHeadModel (line 760) | class OpenAIGPTLMHeadModel(OpenAIGPTPreTrainedModel):
    method __init__ (line 821) | def __init__(self, config, output_attentions=False, keep_multihead_out...
    method set_num_special_tokens (line 828) | def set_num_special_tokens(self, num_special_tokens, predict_special_t...
    method forward (line 836) | def forward(self, input_ids, position_ids=None, token_type_ids=None, l...
  class OpenAIGPTDoubleHeadsModel (line 857) | class OpenAIGPTDoubleHeadsModel(OpenAIGPTPreTrainedModel):
    method __init__ (line 923) | def __init__(self, config, output_attentions=False, keep_multihead_out...
    method set_num_special_tokens (line 931) | def set_num_special_tokens(self, num_special_tokens, predict_special_t...
    method forward (line 939) | def forward(self, input_ids, mc_token_ids, lm_labels=None, mc_labels=N...

FILE: pytorch_pretrained_bert/modeling_transfo_xl.py
  function build_tf_to_pytorch_map (line 53) | def build_tf_to_pytorch_map(model, config):
  function load_tf_weights_in_transfo_xl (line 125) | def load_tf_weights_in_transfo_xl(model, config, tf_path):
  class TransfoXLConfig (line 181) | class TransfoXLConfig(object):
    method __init__ (line 184) | def __init__(self,
    method from_dict (line 289) | def from_dict(cls, json_object):
    method from_json_file (line 297) | def from_json_file(cls, json_file):
    method __repr__ (line 303) | def __repr__(self):
    method to_dict (line 306) | def to_dict(self):
    method to_json_string (line 311) | def to_json_string(self):
    method to_json_file (line 315) | def to_json_file(self, json_file_path):
  class PositionalEmbedding (line 321) | class PositionalEmbedding(nn.Module):
    method __init__ (line 322) | def __init__(self, demb):
    method forward (line 330) | def forward(self, pos_seq, bsz=None):
  class PositionwiseFF (line 340) | class PositionwiseFF(nn.Module):
    method __init__ (line 341) | def __init__(self, d_model, d_inner, dropout, pre_lnorm=False):
    method forward (line 359) | def forward(self, inp):
  class MultiHeadAttn (line 375) | class MultiHeadAttn(nn.Module):
    method __init__ (line 376) | def __init__(self, n_head, d_model, d_head, dropout, dropatt=0,
    method forward (line 405) | def forward(self, h, attn_mask=None, mems=None):
  class RelMultiHeadAttn (line 456) | class RelMultiHeadAttn(nn.Module):
    method __init__ (line 457) | def __init__(self, n_head, d_model, d_head, dropout, dropatt=0,
    method _parallelogram_mask (line 486) | def _parallelogram_mask(self, h, w, left=False):
    method _shift (line 497) | def _shift(self, x, qlen, klen, mask, left=False):
    method _rel_shift (line 515) | def _rel_shift(self, x, zero_triu=False):
    method forward (line 531) | def forward(self, w, r, attn_mask=None, mems=None):
  class RelPartialLearnableMultiHeadAttn (line 534) | class RelPartialLearnableMultiHeadAttn(RelMultiHeadAttn):
    method __init__ (line 535) | def __init__(self, *args, **kwargs):
    method forward (line 540) | def forward(self, w, r, attn_mask=None, mems=None):
  class RelLearnableMultiHeadAttn (line 615) | class RelLearnableMultiHeadAttn(RelMultiHeadAttn):
    method __init__ (line 616) | def __init__(self, *args, **kwargs):
    method forward (line 619) | def forward(self, w, r_emb, r_w_bias, r_bias, attn_mask=None, mems=None):
  class DecoderLayer (line 700) | class DecoderLayer(nn.Module):
    method __init__ (line 701) | def __init__(self, n_head, d_model, d_head, d_inner, dropout, **kwargs):
    method forward (line 708) | def forward(self, dec_inp, dec_attn_mask=None, mems=None):
  class RelLearnableDecoderLayer (line 716) | class RelLearnableDecoderLayer(nn.Module):
    method __init__ (line 717) | def __init__(self, n_head, d_model, d_head, d_inner, dropout,
    method forward (line 726) | def forward(self, dec_inp, r_emb, r_w_bias, r_bias, dec_attn_mask=None...
  class RelPartialLearnableDecoderLayer (line 735) | class RelPartialLearnableDecoderLayer(nn.Module):
    method __init__ (line 736) | def __init__(self, n_head, d_model, d_head, d_inner, dropout,
    method forward (line 745) | def forward(self, dec_inp, r, dec_attn_mask=None, mems=None):
  class AdaptiveEmbedding (line 755) | class AdaptiveEmbedding(nn.Module):
    method __init__ (line 756) | def __init__(self, n_token, d_embed, d_proj, cutoffs, div_val=1,
    method forward (line 786) | def forward(self, inp):
  class TransfoXLPreTrainedModel (line 819) | class TransfoXLPreTrainedModel(nn.Module):
    method __init__ (line 823) | def __init__(self, config, *inputs, **kwargs):
    method init_weight (line 834) | def init_weight(self, weight):
    method init_bias (line 840) | def init_bias(self, bias):
    method init_weights (line 843) | def init_weights(self, m):
    method set_num_special_tokens (line 884) | def set_num_special_tokens(self, num_special_tokens):
    method from_pretrained (line 888) | def from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwa...
  class TransfoXLModel (line 1012) | class TransfoXLModel(TransfoXLPreTrainedModel):
    method __init__ (line 1052) | def __init__(self, config):
    method backward_compatible (line 1127) | def backward_compatible(self):
    method reset_length (line 1131) | def reset_length(self, tgt_len, ext_len, mem_len):
    method init_mems (line 1136) | def init_mems(self, data):
    method _update_mems (line 1149) | def _update_mems(self, hids, mems, qlen, mlen):
    method _forward (line 1172) | def _forward(self, dec_inp, mems=None):
    method forward (line 1262) | def forward(self, input_ids, mems=None):
  class TransfoXLLMHeadModel (line 1289) | class TransfoXLLMHeadModel(TransfoXLPreTrainedModel):
    method __init__ (line 1339) | def __init__(self, config):
    method tie_weights (line 1354) | def tie_weights(self):
    method reset_length (line 1372) | def reset_length(self, tgt_len, ext_len, mem_len):
    method init_mems (line 1375) | def init_mems(self, data):
    method forward (line 1378) | def forward(self, input_ids, target=None, mems=None):

FILE: pytorch_pretrained_bert/modeling_transfo_xl_utilities.py
  class ProjectedAdaptiveLogSoftmax (line 31) | class ProjectedAdaptiveLogSoftmax(nn.Module):
    method __init__ (line 32) | def __init__(self, n_token, d_embed, d_proj, cutoffs, div_val=1,
    method _compute_logit (line 78) | def _compute_logit(self, hidden, weight, bias, proj):
    method forward (line 92) | def forward(self, hidden, target=None, keep_order=False):
    method log_prob (line 198) | def log_prob(self, hidden):
  class LogUniformSampler (line 260) | class LogUniformSampler(object):
    method __init__ (line 261) | def __init__(self, range_max, n_sample):
    method sample (line 281) | def sample(self, labels):
  function sample_logits (line 302) | def sample_logits(embedding, bias, labels, inputs, sampler):

FILE: pytorch_pretrained_bert/optimization.py
  class _LRSchedule (line 35) | class _LRSchedule(ABC):
    method __init__ (line 38) | def __init__(self, warmup=0.002, t_total=-1, **kw):
    method get_lr (line 53) | def get_lr(self, step, nowarn=False):
    method get_lr_ (line 73) | def get_lr_(self, progress):
  class ConstantLR (line 81) | class ConstantLR(_LRSchedule):
    method get_lr_ (line 82) | def get_lr_(self, progress):
  class WarmupCosineSchedule (line 86) | class WarmupCosineSchedule(_LRSchedule):
    method __init__ (line 93) | def __init__(self, warmup=0.002, t_total=-1, cycles=.5, **kw):
    method get_lr_ (line 103) | def get_lr_(self, progress):
  class WarmupCosineWithHardRestartsSchedule (line 111) | class WarmupCosineWithHardRestartsSchedule(WarmupCosineSchedule):
    method __init__ (line 117) | def __init__(self, warmup=0.002, t_total=-1, cycles=1., **kw):
    method get_lr_ (line 121) | def get_lr_(self, progress):
  class WarmupCosineWithWarmupRestartsSchedule (line 130) | class WarmupCosineWithWarmupRestartsSchedule(WarmupCosineWithHardRestart...
    method __init__ (line 136) | def __init__(self, warmup=0.002, t_total=-1, cycles=1., **kw):
    method get_lr_ (line 141) | def get_lr_(self, progress):
  class WarmupConstantSchedule (line 151) | class WarmupConstantSchedule(_LRSchedule):
    method get_lr_ (line 156) | def get_lr_(self, progress):
  class WarmupLinearSchedule (line 162) | class WarmupLinearSchedule(_LRSchedule):
    method get_lr_ (line 168) | def get_lr_(self, progress):
  class BertAdam (line 183) | class BertAdam(Optimizer):
    method __init__ (line 199) | def __init__(self, params, lr=required, warmup=-1, t_total=-1, schedul...
    method get_lr (line 224) | def get_lr(self):
    method step (line 236) | def step(self, closure=None):

FILE: pytorch_pretrained_bert/optimization_openai.py
  class OpenAIAdam (line 29) | class OpenAIAdam(Optimizer):
    method __init__ (line 32) | def __init__(self, params, lr=required, schedule='warmup_linear', warm...
    method get_lr (line 58) | def get_lr(self):
    method step (line 70) | def step(self, closure=None):

FILE: pytorch_pretrained_bert/tokenization.py
  function load_vocab (line 50) | def load_vocab(vocab_file):
  function whitespace_tokenize (line 65) | def whitespace_tokenize(text):
  class BertTokenizer (line 74) | class BertTokenizer(object):
    method __init__ (line 77) | def __init__(self, vocab_file, do_lower_case=True, max_len=None, do_ba...
    method tokenize (line 107) | def tokenize(self, text, entity_pos=None):
    method convert_tokens_to_ids (line 141) | def convert_tokens_to_ids(self, tokens):
    method convert_ids_to_tokens (line 154) | def convert_ids_to_tokens(self, ids):
    method save_vocabulary (line 161) | def save_vocabulary(self, vocab_path):
    method from_pretrained (line 177) | def from_pretrained(cls, pretrained_model_name_or_path, cache_dir=None...
  class BasicTokenizer (line 225) | class BasicTokenizer(object):
    method __init__ (line 228) | def __init__(self,
    method tokenize (line 239) | def tokenize(self, text):
    method _run_strip_accents (line 260) | def _run_strip_accents(self, text):
    method _run_split_on_punc (line 271) | def _run_split_on_punc(self, text):
    method _tokenize_chinese_chars (line 293) | def _tokenize_chinese_chars(self, text):
    method _is_chinese_char (line 306) | def _is_chinese_char(self, cp):
    method _clean_text (line 328) | def _clean_text(self, text):
  class WordpieceTokenizer (line 342) | class WordpieceTokenizer(object):
    method __init__ (line 345) | def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=...
    method tokenize (line 350) | def tokenize(self, text):
  function _is_whitespace (line 402) | def _is_whitespace(char):
  function _is_control (line 414) | def _is_control(char):
  function _is_punctuation (line 426) | def _is_punctuation(char):

FILE: pytorch_pretrained_bert/tokenization_gpt2.py
  function lru_cache (line 31) | def lru_cache():
  function bytes_to_unicode (line 54) | def bytes_to_unicode():
  function get_pairs (line 76) | def get_pairs(word):
  class GPT2Tokenizer (line 88) | class GPT2Tokenizer(object):
    method from_pretrained (line 94) | def from_pretrained(cls, pretrained_model_name_or_path, cache_dir=None...
    method __init__ (line 151) | def __init__(self, vocab_file, merges_file, errors='replace', special_...
    method __len__ (line 170) | def __len__(self):
    method set_special_tokens (line 173) | def set_special_tokens(self, special_tokens):
    method bpe (line 186) | def bpe(self, token):
    method tokenize (line 227) | def tokenize(self, text):
    method convert_tokens_to_ids (line 238) | def convert_tokens_to_ids(self, tokens):
    method convert_ids_to_tokens (line 259) | def convert_ids_to_tokens(self, ids, skip_special_tokens=False):
    method encode (line 270) | def encode(self, text):
    method decode (line 273) | def decode(self, tokens, skip_special_tokens=False, clean_up_tokenizat...
    method save_vocabulary (line 283) | def save_vocabulary(self, vocab_path):

FILE: pytorch_pretrained_bert/tokenization_openai.py
  function get_pairs (line 46) | def get_pairs(word):
  function text_standardize (line 58) | def text_standardize(text):
  class OpenAIGPTTokenizer (line 73) | class OpenAIGPTTokenizer(object):
    method from_pretrained (line 82) | def from_pretrained(cls, pretrained_model_name_or_path, cache_dir=None...
    method __init__ (line 139) | def __init__(self, vocab_file, merges_file, special_tokens=None, max_l...
    method __len__ (line 162) | def __len__(self):
    method set_special_tokens (line 165) | def set_special_tokens(self, special_tokens):
    method bpe (line 181) | def bpe(self, token):
    method tokenize (line 224) | def tokenize(self, text):
    method convert_tokens_to_ids (line 239) | def convert_tokens_to_ids(self, tokens):
    method convert_ids_to_tokens (line 260) | def convert_ids_to_tokens(self, ids, skip_special_tokens=False):
    method encode (line 271) | def encode(self, text):
    method decode (line 274) | def decode(self, ids, skip_special_tokens=False, clean_up_tokenization...
    method save_vocabulary (line 285) | def save_vocabulary(self, vocab_path):

FILE: pytorch_pretrained_bert/tokenization_transfo_xl.py
  class TransfoXLTokenizer (line 53) | class TransfoXLTokenizer(object):
    method from_pretrained (line 58) | def from_pretrained(cls, pretrained_model_name_or_path, cache_dir=None...
    method __init__ (line 101) | def __init__(self, special=[], min_freq=0, max_size=None, lower_case=F...
    method count_file (line 112) | def count_file(self, path, verbose=False, add_eos=False):
    method count_sents (line 127) | def count_sents(self, sents, verbose=False):
    method _build_from_file (line 137) | def _build_from_file(self, vocab_file):
    method save_vocabulary (line 152) | def save_vocabulary(self, vocab_path):
    method build_vocab (line 160) | def build_vocab(self):
    method encode_file (line 181) | def encode_file(self, path, ordered=False, verbose=False, add_eos=True,
    method encode_sents (line 199) | def encode_sents(self, sents, ordered=False, verbose=False):
    method add_special (line 212) | def add_special(self, sym):
    method add_symbol (line 218) | def add_symbol(self, sym):
    method get_sym (line 223) | def get_sym(self, idx):
    method get_idx (line 227) | def get_idx(self, sym):
    method convert_ids_to_tokens (line 243) | def convert_ids_to_tokens(self, indices):
    method convert_tokens_to_ids (line 247) | def convert_tokens_to_ids(self, symbols):
    method convert_to_tensor (line 251) | def convert_to_tensor(self, symbols):
    method decode (line 254) | def decode(self, indices, exclude=None):
    method __len__ (line 261) | def __len__(self):
    method tokenize (line 264) | def tokenize(self, line, add_eos=False, add_double_eos=False):
  class LMOrderedIterator (line 284) | class LMOrderedIterator(object):
    method __init__ (line 285) | def __init__(self, data, bsz, bptt, device='cpu', ext_len=None):
    method get_batch (line 307) | def get_batch(self, i, bptt=None):
    method get_fixlen_iter (line 322) | def get_fixlen_iter(self, start=0):
    method get_varlen_iter (line 326) | def get_varlen_iter(self, start=0, std=5, min_len=5, max_deviation=3):
    method __iter__ (line 338) | def __iter__(self):
  class LMShuffledIterator (line 342) | class LMShuffledIterator(object):
    method __init__ (line 343) | def __init__(self, data, bsz, bptt, device='cpu', ext_len=None, shuffl...
    method get_sent_stream (line 356) | def get_sent_stream(self):
    method stream_iterator (line 365) | def stream_iterator(self, sent_stream):
    method __iter__ (line 414) | def __iter__(self):
  class LMMultiFileIterator (line 422) | class LMMultiFileIterator(LMShuffledIterator):
    method __init__ (line 423) | def __init__(self, paths, vocab, bsz, bptt, device='cpu', ext_len=None,
    method get_sent_stream (line 436) | def get_sent_stream(self, path):
    method __iter__ (line 444) | def __iter__(self):
  class TransfoXLCorpus (line 455) | class TransfoXLCorpus(object):
    method from_pretrained (line 457) | def from_pretrained(cls, pretrained_model_name_or_path, cache_dir=None...
    method __init__ (line 499) | def __init__(self, *args, **kwargs):
    method build_corpus (line 506) | def build_corpus(self, path, dataset):
    method get_iterator (line 545) | def get_iterator(self, split, *args, **kwargs):
  function get_lm_corpus (line 562) | def get_lm_corpus(datadir, dataset):

FILE: tests/conftest.py
  function pytest_addoption (line 6) | def pytest_addoption(parser):
  function pytest_collection_modifyitems (line 12) | def pytest_collection_modifyitems(config, items):

FILE: tests/modeling_gpt2_test.py
  class GPT2ModelTest (line 32) | class GPT2ModelTest(unittest.TestCase):
    class GPT2ModelTester (line 33) | class GPT2ModelTester(object):
      method __init__ (line 35) | def __init__(self,
      method prepare_config_and_inputs (line 73) | def prepare_config_and_inputs(self):
      method create_gpt2_model (line 106) | def create_gpt2_model(self, config, input_ids, token_type_ids, posit...
      method check_gpt2_model_output (line 117) | def check_gpt2_model_output(self, result):
      method create_gpt2_lm_head (line 124) | def create_gpt2_lm_head(self, config, input_ids, token_type_ids, pos...
      method create_gpt2_lm_head_with_output_attention (line 137) | def create_gpt2_lm_head_with_output_attention(self, config, input_id...
      method check_gpt2_lm_head_output (line 151) | def check_gpt2_lm_head_output(self, result):
      method check_gpt2_lm_head_loss_output (line 161) | def check_gpt2_lm_head_loss_output(self, result):
      method create_gpt2_double_heads (line 166) | def create_gpt2_double_heads(self, config, input_ids, token_type_ids...
      method create_gpt2_double_heads_with_output_attention (line 182) | def create_gpt2_double_heads_with_output_attention(self, config, inp...
      method check_gpt2_double_heads_output (line 199) | def check_gpt2_double_heads_output(self, result):
      method check_gpt2_double_heads_loss_output (line 208) | def check_gpt2_double_heads_loss_output(self, result):
      method create_and_check_gpt2_for_headmasking (line 213) | def create_and_check_gpt2_for_headmasking(self, config, input_ids, t...
      method create_and_check_gpt2_for_head_pruning (line 268) | def create_and_check_gpt2_for_head_pruning(self, config, input_ids, ...
    method test_default (line 305) | def test_default(self):
    method test_config_to_json_string (line 308) | def test_config_to_json_string(self):
    method test_config_to_json_file (line 314) | def test_config_to_json_file(self):
    method test_model_from_pretrained (line 323) | def test_model_from_pretrained(self):
    method run_tester (line 330) | def run_tester(self, tester):
    method ids_tensor (line 347) | def ids_tensor(cls, shape, vocab_size, rng=None, name=None):

FILE: tests/modeling_openai_test.py
  class OpenAIGPTModelTest (line 32) | class OpenAIGPTModelTest(unittest.TestCase):
    class OpenAIGPTModelTester (line 33) | class OpenAIGPTModelTester(object):
      method __init__ (line 35) | def __init__(self,
      method prepare_config_and_inputs (line 81) | def prepare_config_and_inputs(self):
      method create_openai_model (line 117) | def create_openai_model(self, config, input_ids, token_type_ids, pos...
      method check_openai_model_output (line 127) | def check_openai_model_output(self, result):
      method create_openai_lm_head (line 134) | def create_openai_lm_head(self, config, input_ids, token_type_ids, p...
      method check_openai_lm_head_output (line 146) | def check_openai_lm_head_output(self, result):
      method check_openai_lm_head_loss_output (line 152) | def check_openai_lm_head_loss_output(self, result):
      method create_openai_double_heads (line 157) | def create_openai_double_heads(self, config, input_ids, token_type_i...
      method check_openai_double_heads_output (line 172) | def check_openai_double_heads_output(self, result):
      method check_openai_double_heads_loss_output (line 181) | def check_openai_double_heads_loss_output(self, result):
      method create_and_check_openai_for_headmasking (line 186) | def create_and_check_openai_for_headmasking(self, config, input_ids,...
      method create_and_check_openai_for_head_pruning (line 242) | def create_and_check_openai_for_head_pruning(self, config, input_ids...
    method test_default (line 279) | def test_default(self):
    method test_config_to_json_string (line 282) | def test_config_to_json_string(self):
    method test_config_to_json_file (line 288) | def test_config_to_json_file(self):
    method test_model_from_pretrained (line 297) | def test_model_from_pretrained(self):
    method run_tester (line 304) | def run_tester(self, tester):
    method ids_tensor (line 321) | def ids_tensor(cls, shape, vocab_size, rng=None, name=None):

FILE: tests/modeling_test.py
  class BertModelTest (line 35) | class BertModelTest(unittest.TestCase):
    class BertModelTester (line 36) | class BertModelTester(object):
      method __init__ (line 38) | def __init__(self,
      method prepare_config_and_inputs (line 84) | def prepare_config_and_inputs(self):
      method check_loss_output (line 118) | def check_loss_output(self, result):
      method create_bert_model (line 123) | def create_bert_model(self, config, input_ids, token_type_ids, input...
      method check_bert_model_output (line 134) | def check_bert_model_output(self, result):
      method create_bert_for_masked_lm (line 144) | def create_bert_for_masked_lm(self, config, input_ids, token_type_id...
      method check_bert_for_masked_lm_output (line 155) | def check_bert_for_masked_lm_output(self, result):
      method create_bert_for_next_sequence_prediction (line 160) | def create_bert_for_next_sequence_prediction(self, config, input_ids...
      method check_bert_for_next_sequence_prediction_output (line 171) | def check_bert_for_next_sequence_prediction_output(self, result):
      method create_bert_for_pretraining (line 177) | def create_bert_for_pretraining(self, config, input_ids, token_type_...
      method check_bert_for_pretraining_output (line 189) | def check_bert_for_pretraining_output(self, result):
      method create_bert_for_question_answering (line 198) | def create_bert_for_question_answering(self, config, input_ids, toke...
      method check_bert_for_question_answering_output (line 210) | def check_bert_for_question_answering_output(self, result):
      method create_bert_for_sequence_classification (line 219) | def create_bert_for_sequence_classification(self, config, input_ids,...
      method check_bert_for_sequence_classification_output (line 230) | def check_bert_for_sequence_classification_output(self, result):
      method create_bert_for_token_classification (line 236) | def create_bert_for_token_classification(self, config, input_ids, to...
      method check_bert_for_token_classification_output (line 247) | def check_bert_for_token_classification_output(self, result):
      method create_bert_for_multiple_choice (line 253) | def create_bert_for_multiple_choice(self, config, input_ids, token_t...
      method check_bert_for_multiple_choice (line 272) | def check_bert_for_multiple_choice(self, result):
      method create_and_check_bert_for_attentions (line 278) | def create_and_check_bert_for_attentions(self, config, input_ids, to...
      method create_and_check_bert_for_headmasking (line 296) | def create_and_check_bert_for_headmasking(self, config, input_ids, t...
      method create_and_check_bert_for_head_pruning (line 356) | def create_and_check_bert_for_head_pruning(self, config, input_ids, ...
    method test_default (line 397) | def test_default(self):
    method test_config_to_json_string (line 400) | def test_config_to_json_string(self):
    method test_config_to_json_file (line 406) | def test_config_to_json_file(self):
    method test_model_from_pretrained (line 415) | def test_model_from_pretrained(self):
    method run_tester (line 422) | def run_tester(self, tester):
    method ids_tensor (line 460) | def ids_tensor(cls, shape, vocab_size, rng=None, name=None):

FILE: tests/modeling_transfo_xl_test.py
  class TransfoXLModelTest (line 31) | class TransfoXLModelTest(unittest.TestCase):
    class TransfoXLModelTester (line 32) | class TransfoXLModelTester(object):
      method __init__ (line 34) | def __init__(self,
      method prepare_config_and_inputs (line 72) | def prepare_config_and_inputs(self):
      method set_seed (line 95) | def set_seed(self):
      method create_transfo_xl_model (line 99) | def create_transfo_xl_model(self, config, input_ids_1, input_ids_2, ...
      method check_transfo_xl_model_output (line 113) | def check_transfo_xl_model_output(self, result):
      method create_transfo_xl_lm_head (line 128) | def create_transfo_xl_lm_head(self, config, input_ids_1, input_ids_2...
      method check_transfo_xl_lm_head_output (line 150) | def check_transfo_xl_lm_head_output(self, result):
    method test_default (line 183) | def test_default(self):
    method test_config_to_json_string (line 186) | def test_config_to_json_string(self):
    method test_config_to_json_file (line 192) | def test_config_to_json_file(self):
    method test_model_from_pretrained (line 201) | def test_model_from_pretrained(self):
    method run_tester (line 208) | def run_tester(self, tester):
    method ids_tensor (line 220) | def ids_tensor(cls, shape, vocab_size, rng=None, name=None):

FILE: tests/optimization_test.py
  class OptimizationTest (line 30) | class OptimizationTest(unittest.TestCase):
    method assertListAlmostEqual (line 32) | def assertListAlmostEqual(self, list1, list2, tol):
    method test_adam (line 37) | def test_adam(self):
  class ScheduleInitTest (line 54) | class ScheduleInitTest(unittest.TestCase):
    method test_bert_sched_init (line 55) | def test_bert_sched_init(self):
    method test_openai_sched_init (line 65) | def test_openai_sched_init(self):
  class WarmupCosineWithRestartsTest (line 76) | class WarmupCosineWithRestartsTest(unittest.TestCase):
    method test_it (line 77) | def test_it(self):

FILE: tests/tokenization_gpt2_test.py
  class GPT2TokenizationTest (line 26) | class GPT2TokenizationTest(unittest.TestCase):
    method test_full_tokenizer (line 28) | def test_full_tokenizer(self):
    method test_tokenizer_from_pretrained (line 69) | def test_tokenizer_from_pretrained(self):

FILE: tests/tokenization_openai_test.py
  class OpenAIGPTTokenizationTest (line 26) | class OpenAIGPTTokenizationTest(unittest.TestCase):
    method test_full_tokenizer (line 28) | def test_full_tokenizer(self):
    method test_tokenizer_from_pretrained (line 70) | def test_tokenizer_from_pretrained(self):

FILE: tests/tokenization_test.py
  class TokenizationTest (line 30) | class TokenizationTest(unittest.TestCase):
    method test_full_tokenizer (line 32) | def test_full_tokenizer(self):
    method test_tokenizer_from_pretrained (line 62) | def test_tokenizer_from_pretrained(self):
    method test_chinese (line 69) | def test_chinese(self):
    method test_basic_tokenizer_lower (line 76) | def test_basic_tokenizer_lower(self):
    method test_basic_tokenizer_no_lower (line 84) | def test_basic_tokenizer_no_lower(self):
    method test_wordpiece_tokenizer (line 91) | def test_wordpiece_tokenizer(self):
    method test_is_whitespace (line 111) | def test_is_whitespace(self):
    method test_is_control (line 121) | def test_is_control(self):
    method test_is_punctuation (line 129) | def test_is_punctuation(self):

FILE: tests/tokenization_transfo_xl_test.py
  class TransfoXLTokenizationTest (line 26) | class TransfoXLTokenizationTest(unittest.TestCase):
    method test_full_tokenizer (line 28) | def test_full_tokenizer(self):
    method test_full_tokenizer_lower (line 57) | def test_full_tokenizer_lower(self):
    method test_full_tokenizer_no_lower (line 64) | def test_full_tokenizer_no_lower(self):
    method test_tokenizer_from_pretrained (line 72) | def test_tokenizer_from_pretrained(self):
Condensed preview — 65 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (1,500K chars).
[
  {
    "path": "LICENSE",
    "chars": 11358,
    "preview": "\n                                 Apache License\n                           Version 2.0, January 2004\n                  "
  },
  {
    "path": "MANIFEST.in",
    "chars": 16,
    "preview": "include LICENSE\n"
  },
  {
    "path": "README.md",
    "chars": 1870,
    "preview": "### 实现说明\n\n主要实现文章前半部分的工作,PyTorch实现,基于[huggingface](https://github.com/huggingface/pytorch-pretrained-BERT)的工作,PyTorch才是世界"
  },
  {
    "path": "docker/Dockerfile",
    "chars": 197,
    "preview": "FROM pytorch/pytorch:latest\n\nRUN git clone https://github.com/NVIDIA/apex.git && cd apex && python setup.py install --cu"
  },
  {
    "path": "examples/bertology.py",
    "chars": 17149,
    "preview": "#!/usr/bin/env python3\nimport os\nimport argparse\nimport logging\nfrom datetime import timedelta, datetime\nfrom tqdm impor"
  },
  {
    "path": "examples/extract_features.py",
    "chars": 12208,
    "preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under th"
  },
  {
    "path": "examples/lm_finetuning/README.md",
    "chars": 6210,
    "preview": "# BERT Model Finetuning using Masked Language Modeling objective\n\n## Introduction\n\nThe three example scripts in this fol"
  },
  {
    "path": "examples/lm_finetuning/finetune_on_pregenerated.py",
    "chars": 16453,
    "preview": "from argparse import ArgumentParser\nfrom pathlib import Path\nimport os\nimport torch\nimport logging\nimport json\nimport ra"
  },
  {
    "path": "examples/lm_finetuning/pregenerate_training_data.py",
    "chars": 16270,
    "preview": "from argparse import ArgumentParser\nfrom pathlib import Path\nfrom tqdm import tqdm, trange\nfrom tempfile import Temporar"
  },
  {
    "path": "examples/lm_finetuning/simple_lm_finetuning.py",
    "chars": 28381,
    "preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018,"
  },
  {
    "path": "examples/run_classifier.py",
    "chars": 51160,
    "preview": "#coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, "
  },
  {
    "path": "examples/run_classifier_dataset_utils.py",
    "chars": 19787,
    "preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018,"
  },
  {
    "path": "examples/run_gpt2.py",
    "chars": 5222,
    "preview": "#!/usr/bin/env python3\n\nimport argparse\nimport logging\nfrom tqdm import trange\n\nimport torch\nimport torch.nn.functional "
  },
  {
    "path": "examples/run_openai_gpt.py",
    "chars": 13653,
    "preview": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. "
  },
  {
    "path": "examples/run_squad.py",
    "chars": 21799,
    "preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018,"
  },
  {
    "path": "examples/run_squad_dataset_utils.py",
    "chars": 30976,
    "preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018,"
  },
  {
    "path": "examples/run_swag.py",
    "chars": 24323,
    "preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018,"
  },
  {
    "path": "examples/run_transfo_xl.py",
    "chars": 6735,
    "preview": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. "
  },
  {
    "path": "examples/sem_run_classifier.py",
    "chars": 51160,
    "preview": "#coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, "
  },
  {
    "path": "examples/tacred_run_classifier.py",
    "chars": 50886,
    "preview": "#coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, "
  },
  {
    "path": "examples/tacred_run_infer.py",
    "chars": 23695,
    "preview": "from __future__ import absolute_import, division, print_function\n\nimport argparse\nimport csv\nimport logging\nimport os\nim"
  },
  {
    "path": "examples/test.sh",
    "chars": 492,
    "preview": "#export GLUE_DIR=/data/share/zhanghaipeng/pytorch-pretrained-BERT/examples/general_ner_test\nexport GLUE_DIR=/data/share/"
  },
  {
    "path": "examples/train.sh",
    "chars": 461,
    "preview": "export GLUE_DIR=/data/share/zhanghaipeng/tre/datasets/data\nexport TASK_NAME=tacred\n\nEXPR=25\nBS=16\nCUDA=2\nLR=3e-5\nEPOCH=4"
  },
  {
    "path": "hubconf.py",
    "chars": 723,
    "preview": "dependencies = ['torch', 'tqdm', 'boto3', 'requests', 'regex']\n\nfrom hubconfs.bert_hubconf import (\n    bertTokenizer,\n "
  },
  {
    "path": "hubconfs/bert_hubconf.py",
    "chars": 17306,
    "preview": "from pytorch_pretrained_bert.tokenization import BertTokenizer\nfrom pytorch_pretrained_bert.modeling import (\n        Be"
  },
  {
    "path": "hubconfs/gpt2_hubconf.py",
    "chars": 7052,
    "preview": "from pytorch_pretrained_bert.tokenization_gpt2 import GPT2Tokenizer\nfrom pytorch_pretrained_bert.modeling_gpt2 import (\n"
  },
  {
    "path": "hubconfs/gpt_hubconf.py",
    "chars": 8281,
    "preview": "from pytorch_pretrained_bert.tokenization_openai import OpenAIGPTTokenizer\nfrom pytorch_pretrained_bert.modeling_openai "
  },
  {
    "path": "hubconfs/transformer_xl_hubconf.py",
    "chars": 5856,
    "preview": "from pytorch_pretrained_bert.tokenization_transfo_xl import TransfoXLTokenizer\nfrom pytorch_pretrained_bert.modeling_tra"
  },
  {
    "path": "notebooks/Comparing-PT-and-TF-models.ipynb",
    "chars": 92238,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Pytorch to Tensorflow Conversion "
  },
  {
    "path": "notebooks/Comparing-TF-and-PT-models-MLM-NSP.ipynb",
    "chars": 173162,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Comparing TensorFlow (original) a"
  },
  {
    "path": "notebooks/Comparing-TF-and-PT-models-SQuAD.ipynb",
    "chars": 207537,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Comparing TensorFlow (original) a"
  },
  {
    "path": "notebooks/Comparing-TF-and-PT-models.ipynb",
    "chars": 62623,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Comparing TensorFlow (original) a"
  },
  {
    "path": "pytorch_pretrained_bert/__init__.py",
    "chars": 1337,
    "preview": "__version__ = \"0.6.2\"\nfrom .tokenization import BertTokenizer, BasicTokenizer, WordpieceTokenizer\nfrom .tokenization_ope"
  },
  {
    "path": "pytorch_pretrained_bert/__main__.py",
    "chars": 4393,
    "preview": "# coding: utf8\ndef main():\n    import sys\n    if (len(sys.argv) != 4 and len(sys.argv) != 5) or sys.argv[1] not in [\n   "
  },
  {
    "path": "pytorch_pretrained_bert/convert_gpt2_checkpoint_to_pytorch.py",
    "chars": 3017,
    "preview": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Lice"
  },
  {
    "path": "pytorch_pretrained_bert/convert_openai_checkpoint_to_pytorch.py",
    "chars": 3106,
    "preview": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Lice"
  },
  {
    "path": "pytorch_pretrained_bert/convert_pytorch_checkpoint_to_tf.py",
    "chars": 4343,
    "preview": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Lice"
  },
  {
    "path": "pytorch_pretrained_bert/convert_tf_checkpoint_to_pytorch.py",
    "chars": 2593,
    "preview": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Lice"
  },
  {
    "path": "pytorch_pretrained_bert/convert_transfo_xl_checkpoint_to_pytorch.py",
    "chars": 5671,
    "preview": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Lice"
  },
  {
    "path": "pytorch_pretrained_bert/file_utils.py",
    "chars": 9347,
    "preview": "\"\"\"\nUtilities for working with the local dataset cache.\nThis file is adapted from the AllenNLP library at https://github"
  },
  {
    "path": "pytorch_pretrained_bert/modeling.py",
    "chars": 66537,
    "preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018,"
  },
  {
    "path": "pytorch_pretrained_bert/modeling_gpt2.py",
    "chars": 45614,
    "preview": "# coding=utf-8\n# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORAT"
  },
  {
    "path": "pytorch_pretrained_bert/modeling_openai.py",
    "chars": 46459,
    "preview": "# coding=utf-8\n# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORAT"
  },
  {
    "path": "pytorch_pretrained_bert/modeling_transfo_xl.py",
    "chars": 60075,
    "preview": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. "
  },
  {
    "path": "pytorch_pretrained_bert/modeling_transfo_xl_utilities.py",
    "chars": 16108,
    "preview": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. "
  },
  {
    "path": "pytorch_pretrained_bert/optimization.py",
    "chars": 13047,
    "preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under th"
  },
  {
    "path": "pytorch_pretrained_bert/optimization_openai.py",
    "chars": 5558,
    "preview": "# coding=utf-8\n# Copyright 2018 The Open AI Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache Li"
  },
  {
    "path": "pytorch_pretrained_bert/tokenization.py",
    "chars": 18201,
    "preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under th"
  },
  {
    "path": "pytorch_pretrained_bert/tokenization_gpt2.py",
    "chars": 14181,
    "preview": "# coding=utf-8\n# Copyright 2018 The Open AI Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache Li"
  },
  {
    "path": "pytorch_pretrained_bert/tokenization_openai.py",
    "chars": 14189,
    "preview": "# coding=utf-8\n# Copyright 2018 The Open AI Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache Li"
  },
  {
    "path": "pytorch_pretrained_bert/tokenization_transfo_xl.py",
    "chars": 22339,
    "preview": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. "
  },
  {
    "path": "requirements.txt",
    "chars": 196,
    "preview": "# PyTorch\ntorch>=0.4.1\n# progress bars in model download and training scripts\ntqdm\n# Accessing files from S3 directly.\nb"
  },
  {
    "path": "samples/input.txt",
    "chars": 52,
    "preview": "Who was Jim Henson ? ||| Jim Henson was a puppeteer\n"
  },
  {
    "path": "samples/sample_text.txt",
    "chars": 4364,
    "preview": "This text is included to make sure Unicode is handled properly: 力加勝北区ᴵᴺᵀᵃছজটডণত\nText should be one-sentence-per-line, wi"
  },
  {
    "path": "setup.py",
    "chars": 2798,
    "preview": "\"\"\"\nSimple check list from AllenNLP repo: https://github.com/allenai/allennlp/blob/master/setup.py\n\nTo create the packag"
  },
  {
    "path": "tests/conftest.py",
    "chars": 511,
    "preview": "# content of conftest.py\n\nimport pytest\n\n\ndef pytest_addoption(parser):\n    parser.addoption(\n        \"--runslow\", actio"
  },
  {
    "path": "tests/modeling_gpt2_test.py",
    "chars": 16770,
    "preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors.\n#\n# Licensed under the Apache License, Version 2.0 "
  },
  {
    "path": "tests/modeling_openai_test.py",
    "chars": 15409,
    "preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors.\n#\n# Licensed under the Apache License, Version 2.0 "
  },
  {
    "path": "tests/modeling_test.py",
    "chars": 23337,
    "preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors.\n#\n# Licensed under the Apache License, Version 2.0 "
  },
  {
    "path": "tests/modeling_transfo_xl_test.py",
    "chars": 9474,
    "preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors.\n#\n# Licensed under the Apache License, Version 2.0 "
  },
  {
    "path": "tests/optimization_test.py",
    "chars": 3927,
    "preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors.\n#\n# Licensed under the Apache License, Version 2.0 "
  },
  {
    "path": "tests/tokenization_gpt2_test.py",
    "chars": 3124,
    "preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors.\n#\n# Licensed under the Apache License, Version 2.0 "
  },
  {
    "path": "tests/tokenization_openai_test.py",
    "chars": 3222,
    "preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors.\n#\n# Licensed under the Apache License, Version 2.0 "
  },
  {
    "path": "tests/tokenization_test.py",
    "chars": 5090,
    "preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors.\n#\n# Licensed under the Apache License, Version 2.0 "
  },
  {
    "path": "tests/tokenization_transfo_xl_test.py",
    "chars": 2998,
    "preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors.\n#\n# Licensed under the Apache License, Version 2.0 "
  }
]

About this extraction

This page contains the full source code of the zhpmatrix/BERTem GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 65 files (1.4 MB), approximately 420.2k tokens, and a symbol index with 925 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Copied to clipboard!